This document is compiled from an Rmarkdown file which contains all code necessary to reproduce the analysis for the accompanying manuscript “Copy-number signatures and mutational processes in ovarian carcinoma”. Details on how to compile the document can be found in the repository README: https://bitbucket.org/britroc/cnsignatures. Much of the code for signature analysis can be found in the main_functions.R and helper_functions.R files in the base directory of the repositoy. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.
The analysis in this document starts from preprocessed copy-number profiles for all datasets due to restrictions on the underlying raw data. Details on how to reproduce the results from raw data can be found in the main manuscript.
A total of 300 samples were shallow whole-genome and tagged amplicon sequenced. Raw data were processed as outlined in the main manuscript. After inspection of the TP53 mutation status and relative copy-number profiles, 47 were excluded from downstream analysis for the following reasons:
##
## low purity mislabelled not HGSOC no TP53
## 24 7 3 13
Samples were then processed using ABCEL (to be released) to generate absolute copy-number profiles. All samples were manually inspected and the copy-number fit adjusted if necessary (based on evidence from other samples from the same patient).
Following absolute copy-number fitting, the samples were rated using a 1-3 star system.
Star rating per sample:
##
## 1 2 3
## 54 52 147
Maximum star rating per case:
| star_rating | n |
|---|---|
| 1 | 15 |
| 2 | 26 |
| 3 | 91 |
To test the performance of our absolute copy-number fiting from sWGS, we compared ploidy and purity estimates across a subset of samples with matched deepWGS. The estimates for deepWGS samples were extracted from Battenberg plots for those samples.
# Read in ploidy and purity estimates obtained from Battenberg plots.
deep_metrics <- read.csv("data/britroc_deepWGS_ploidy_purity_49.csv", header=T, as.is = T)
deep_metrics$purity <- deep_metrics$purity/100
deep_metrics <- deep_metrics %>%
mutate(name= sub("[0-9]+_tumor_","",x=wgs_ID)) %>%
dplyr::select(name,ploidy, purity) %>%
tidyr::gather("metric", "deep", -name)
# Get ploidy for sWGS samples
CN <- readRDS("data/britroc_absolute_copynumber.rds")
Biobase::pData(CN)$ploidy <- getPloidy(CN) %>%
.$out
shallow_metrics <- Biobase::pData(CN) %>%
dplyr::select(name, ploidy, purity) %>%
filter(name %in% deep_metrics$name) %>%
tidyr::gather("metric", "shallow", -name)
# Annotate sWGS samples with star_rating
samp_annotation_all <- read.csv("data/britroc_sample_data.csv", as.is=T)
samp_annot <- samp_annotation_all %>%
filter(IM.JBLAB_ID %in% shallow_metrics$name & Failed !="Y") %>%
dplyr::select(IM.JBLAB_ID , star_rating)
# Combine estimates from deep and shallow WGS methods and retain 3-star samples
metrics <- inner_join(deep_metrics, shallow_metrics, by = c("name", "metric")) %>%
inner_join(samp_annot,c("name" = "IM.JBLAB_ID" ) ) %>%
filter(star_rating ==3)
# Perform correlation test of ploidy, purity in Britroc samples with dWGS and sWGS data
correlations <- metrics %>%
group_by(metric) %>%
do(broom::tidy(cor.test(.$deep,.$shallow)))
# Create dataframes for annotation and generate plot for 3-star samples.
correlations$metric <- factor(correlations$metric, levels=c("ploidy", "purity"))
annot_text <- correlations %>%
ungroup() %>%
dplyr::select(metric, estimate, p.value) %>%
mutate(x=c(2,0.4),y.cor=c(4.5,0.9), y.pval= c(4.375, 0.87)) %>%
group_by(metric) %>%
mutate(label.cor=paste0("R^2== ",round(estimate,2)), label.pval = paste0("P== ", format(signif(p.value,1),digits=1, scientific = T)) )
cor_plot <- metrics %>%
ggplot(aes(x=deep, y=shallow)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~metric, scales = "free") +
labs(x= "Deep WGS", y= "sWGS") +
theme_bw()+
theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
strip.text.x = element_text(size = 7), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
cor_plot +
geom_text(data=annot_text, aes(label=label.cor, x=x,y=y.cor), parse = T,inherit.aes=FALSE, size=2) +
geom_text(data=annot_text, aes(label=label.pval, x=x,y=y.pval), parse = T,inherit.aes=FALSE, size=2)
These plot show high concordance between ploidy and purity estimated using sWGS and deep WGS. Only two samples showed a significantly different ploidy estimate between sWGS and deepWGS. On closer inspection of the deepWGS results both samples were deemed whole-genome duplication uncertain, meaning both ploidy solutions could have been likely.
For signature analysis we opted to use cases with 3 star copy-number profiles (total=91).
In order to perform copy-number signature identification we summarised each copy-number profile using a number of different feature distributions:
#read in absolute CN post ABCEL processing
all_CN<-readRDS("data/britroc_absolute_copynumber.rds")
all_CN<-all_CN[,colnames(all_CN)%in%samp_annotation[!samp_annotation$Failed=="Y","IM.JBLAB_ID"]]
#extract 3 star CN from each case
ids<-result %>% filter(star_rating==3)
ids<-ids$IM.JBLAB_ID
hq_CN<-all_CN[,colnames(all_CN)%in%ids]
ids<-result %>% filter(star_rating==2)
ids<-ids$IM.JBLAB_ID
lq_CN<-all_CN[,colnames(all_CN)%in%ids]
#estract copy-number features
CN_features<-extractCopynumberFeatures(hq_CN,cores=num_cores)
Following copy-number feature extraction we applied mixture modelling to breakdown each feature distribution into mixtures of Gaussian or mixtures of Poisson distributions using the flexmix package.
seed=77777
min_prior=0.001
model_selection="BIC"
nrep=1
niter=1000
dat<-as.numeric(CN_features[["segsize"]][,2])
segsize_mm<-fitComponent(dat,seed=seed,model_selection=model_selection,
min_prior=min_prior,niter=niter,nrep=nrep,min_comp=10,max_comp=10)
dat<-as.numeric(CN_features[["bp10MB"]][,2])
bp10MB_mm<-fitComponent(dat,dist="pois",seed=seed,model_selection=model_selection,
min_prior=min_prior,niter=niter,nrep=nrep,min_comp=3,max_comp=3)
dat<-as.numeric(CN_features[["osCN"]][,2])
osCN_mm<-fitComponent(dat,dist="pois",seed=seed,model_selection=model_selection,
min_prior=min_prior,niter=niter,nrep=nrep,min_comp=3,max_comp=3)
dat<-as.numeric(CN_features[["bpchrarm"]][,2])
bpchrarm_mm<-fitComponent(dat,dist="pois",seed=seed,model_selection=model_selection,
min_prior=min_prior,niter=niter,nrep=3,min_comp=2,max_comp=5)
dat<-as.numeric(CN_features[["changepoint"]][,2])
changepoint_mm<-fitComponent(dat,seed=seed,model_selection=model_selection,
min_prior=min_prior,niter=niter,nrep=nrep,min_comp=7,max_comp=7)
dat<-as.numeric(CN_features[["copynumber"]][,2])
copynumber_mm<-fitComponent(dat,seed=seed,model_selection=model_selection,
nrep=nrep,min_comp=2,max_comp=10,min_prior=0.005,niter=2000)
CN_components<-list(segsize=segsize_mm,bp10MB=bp10MB_mm,osCN=osCN_mm,changepoint=changepoint_mm,copynumber=copynumber_mm,bpchrarm=bpchrarm_mm)
Once the components were identified we then generated a sample-by-component matrix representing the sum of posterior probabilities of each copy-number event being assigned to each component.
britroc_sample_component_matrix<-generateSampleByComponentMatrix(CN_features,CN_components,cores=1,subcores=num_cores)
NMF::aheatmap(britroc_sample_component_matrix,fontsize = 7,Rowv=FALSE,Colv=FALSE,legend = T,breaks=c(seq(0,199,2),500),main="Component x Sample matrix")
We used the NMF package to factorise the sample-by-component matrix into a signature-by-sample matrix and component by signature matrix.
Previous studies suggest that there should be a minimum of roughly 4 detectable signatures representing mutagenic processes which give rise to distinct patterns of copy-number change: breakage-fusion-bridge, duplicator phenotype, chromothripsis, and amplifier phenotype. Our ability to detect these depends on whether each pattern is sufficiently represented in our data. As NMF demands the upper bound on the number of signatures be << than both the sample and component number, we chose a signature search interval of [3,12]. We ran the matrix factorisation 1000 times for each number of signatures, each with a different random seed, and compared the cophentic, dispersion, silhouette, and sparseness scores for the signature-feature matrix (basis), patient-signature matrix (coefficients) and a consensus matrix of patient-by-patient across the 1000 runs. In addition we performed 1000 random shuffles of the input matrix to get a null estimate of each of the scores.
nmfalg<-"brunet"
seed<-77777
estim.r <- NMF::nmfEstimateRank(t(britroc_sample_component_matrix), 3:12,seed = seed,nrun=1000,
verbose=F,method=nmfalg,.opt = paste0("p",num_cores))
V.random <- randomize(t(britroc_sample_component_matrix))
estim.r.random <- NMF::nmfEstimateRank(V.random, 3:12, seed =seed,nrun=1000,
verbose=F,method=nmfalg,.opt = paste0("p",num_cores))
p<-plot(estim.r,estim.r.random,
what = c("cophenetic", "dispersion","sparseness", "silhouette"),xname="Observed",yname="Randomised",main="")+
theme(axis.text=element_text(size=5),axis.title=element_text(size=5),
strip.text.x = element_text(size = 5),
strip.text.y = element_text(size = 5),
legend.text = element_text(size = 5),
legend.title = element_text(size = 7))
g<-ggplotGrob(p)
g[["grobs"]][[2]]$children[[4]]$size[[1]]<-0.5
g[["grobs"]][[3]]$children[[4]]$size[[1]]<-0.5
g[["grobs"]][[4]]$children[[4]]$size[[1]]<-0.5
g[["grobs"]][[5]]$children[[4]]$size[[1]]<-0.5
grid::grid.newpage()
grid::grid.draw(g)
The plots above suggest that the optimal number of signatures is 7. A signature number higher than this would force the sparseness in the signature by feature matrix (basis) to be greater than that which could be obtained by randomly shuffling the input matrix.
Derived matrices as heatmaps:
nsig<-7
sigs<-NMF::nmf(t(britroc_sample_component_matrix),nsig,seed=seed,nrun=1000,method=nmfalg,.opt = paste0("p",num_cores))
coefmap(sigs,Colv="consensus",tracks=c("basis:"), main="Patient x Signature matrix")
basismap(sigs,Rowv=NA,main="Signature x Component matrix")
To test the robustness of the signature identification approach we applied the same procedure to two independent datasets: deep whole-genome sequencing (approximately 40x) of high-grade serous ovarain cancer samples processed as part of the Pan-Cancer Analysis of Whole Genomes project (PCAWG); and SNParray profiling of HGSOC cases as part of TCGA.
We took the ABSOLUTE copy-number profiles of 112 cases and subjected them to the same feature extraction and NMF procedure as above:
pcawg_CN_features<-readRDS("data/pcawg_CN_features.rds")
pcawg_sample_component_matrix<-generateSampleByComponentMatrix(pcawg_CN_features,CN_components)
NMF::aheatmap(pcawg_sample_component_matrix,Rowv=NULL, main="Component x Sample matrix")
pcawg_ids<-rownames(pcawg_sample_component_matrix)
pcawg_sigs<-NMF::nmf(t(pcawg_sample_component_matrix),nsig,seed=seed,nrun=1000,method=nmfalg,.opt = "p16")
coefmap(pcawg_sigs,Colv="consensus",tracks=c("consensus:"), main="Sample x Signature matrix")#annCol=annCol,
basismap(pcawg_sigs,Rowv=NA,main="Signature x Component matrix")
We used ABSOLLUTE copy-number calls for 415 samples and applied the same signature identification procedure as above:
tcga_CN_features<-readRDS("data/tcga_CN_features.rds")
tcga_sample_component_matrix<-generateSampleByComponentMatrix(tcga_CN_features,CN_components)
NMF::aheatmap(tcga_sample_component_matrix,Rowv=NULL, main="Sample x Component matrix")
tcga_ids<-rownames(tcga_sample_component_matrix)
tcga_sigs<-NMF::nmf(t(tcga_sample_component_matrix),nsig,seed=seed,nrun=1000,method=nmfalg,.opt = "p16")
coefmap(tcga_sigs,Colv="consensus",tracks=c("consensus:"), main="Sample x Signature matrix")#annCol=annCol,
basismap(tcga_sigs,Rowv=NA,main="Signature x Component matrix")
We calculated the Spearman rank correlation between feature vectors for each signature to assess how similar the signatures were across the BRITROC, PCAWG and TCGA cohorts:
reord_components<-c(11:13,24:31,17:23,32:36,14:16,1:10)
#britroc feat_sig matrix
feat_sig_mat<-basis(sigs)
reord_britroc<-as.integer(c(2,6,5,4,7,3,1))
names(reord_britroc)<-paste0("s",1:7)
feat_sig_mat<-feat_sig_mat[,reord_britroc]
colnames(feat_sig_mat)<-paste0("s",1:nsig)
sig_feat_mat<-t(feat_sig_mat)
#pcawg feat_sig matrix
feat_sig_mat_pcawg<-basis(pcawg_sigs)[,]
sig_feat_mat_pcawg<-t(feat_sig_mat_pcawg)
colnames(feat_sig_mat_pcawg)<-paste0("s",1:nsig)
#tcga feat_sig matrix
feat_sig_mat_tcga<-basis(tcga_sigs)[,]
sig_feat_mat_tcga<-t(feat_sig_mat_tcga)
colnames(feat_sig_mat_tcga)<-paste0("s",1:nsig)
#determine matching signatures and their correlation
reord_pcawg<-apply(feat_sig_mat,2,function(x){which.max(apply(feat_sig_mat_pcawg,2,cor,x,method="s"))})
sigcor_pcawg<-apply(feat_sig_mat,2,function(x){max(apply(feat_sig_mat_pcawg,2,cor,x,method="s"))})
reord_tcga<-apply(feat_sig_mat,2,function(x){which.max(apply(feat_sig_mat_tcga,2,cor,x,method="s"))})
sigcor_tcga<-apply(feat_sig_mat,2,function(x){max(apply(feat_sig_mat_tcga,2,cor,x,method="s"))})
#plot the feat_sig matrices side by side
par(mfrow=c(1,3))
basismap(sigs,Rowv=NA,Colv=reord_britroc,main="BritROC",tracks=NA)
basismap(pcawg_sigs,Rowv=NA,Colv=reord_pcawg,main="PCAWG",tracks=NA)
basismap(tcga_sigs,Rowv=NA,Colv=reord_tcga,main="TCGA",tracks=NA)
#reorder feat_sig matrix
sig_feat_mat_pcawg<-data.frame(sig_feat_mat_pcawg[reord_pcawg,],stringsAsFactors = F)
rownames(sig_feat_mat_pcawg)<-paste0("s",1:nsig)
feat_sig_mat_pcawg<-t(sig_feat_mat_pcawg)
sig_feat_mat_tcga<-data.frame(sig_feat_mat_tcga[reord_tcga,],stringsAsFactors = F)
rownames(sig_feat_mat_tcga)<-paste0("s",1:nsig)
feat_sig_mat_tcga<-t(sig_feat_mat_tcga)
b_vs_p<-c()
for(i in 1:nsig)
{
b_vs_p<-c(b_vs_p,cor.test(feat_sig_mat[,i],feat_sig_mat_pcawg[,i])$p.value)
}
b_vs_t<-c()
for(i in 1:nsig)
{
b_vs_t<-c(b_vs_t,cor.test(feat_sig_mat[,i],feat_sig_mat_tcga[,i])$p.value)
}
b_vs_p<-p.adjust(b_vs_p,method="BH")
b_vs_t<-p.adjust(b_vs_t,method="BH")
sig_comp_out<-rbind(
format(round(sigcor_pcawg,2),scientific=F),
signif(b_vs_p,1),
format(round(sigcor_tcga,2),scientific=F),
signif(b_vs_t,1))
rownames(sig_comp_out)<-c("ICGC correlation","ICGC p-value","TCGA correlation","TCGA p-value")
colnames(sig_comp_out)<-1:nsig
knitr::kable(sig_comp_out)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|
| ICGC correlation | 0.78 | 0.91 | 0.95 | 0.88 | 0.93 | 0.73 | 0.84 |
| ICGC p-value | 1e-43 | 6e-15 | 3e-21 | 2e-15 | 5e-26 | 9e-05 | 2e-09 |
| TCGA correlation | 0.73 | 0.84 | 0.89 | 0.93 | 0.71 | 0.56 | 0.92 |
| TCGA p-value | 5e-41 | 8e-14 | 4e-15 | 3e-11 | 1e-08 | 3e-08 | 1e-21 |
The heatmap and signature correlations above show that the signatures derived from the deep whole-genome sequencing are similar to those derived using sWGS.
Signature exposure matrices were normalised to sum to one and exposures less than 0.01 were considered 0.
sig_thresh<-0.01
sig_pat_mat_hq<-scoef(sigs)
sig_pat_mat_hq<-sig_pat_mat_hq[reord_britroc,]
rownames(sig_pat_mat_hq)<-paste0("s",1:nsig)
sig_pat_mat_hq<-normaliseMatrix(sig_pat_mat_hq,sig_thresh)
sig_pat_mat_pcawg<-scoef(pcawg_sigs)
rownames(sig_pat_mat_pcawg)<-paste0("s",1:nsig)
sig_pat_mat_pcawg<-sig_pat_mat_pcawg[reord_pcawg,]
rownames(sig_pat_mat_pcawg)<-paste0("s",1:nsig)
sig_pat_mat_pcawg<-normaliseMatrix(sig_pat_mat_pcawg,sig_thresh)
sig_pat_mat_tcga<-scoef(tcga_sigs)
rownames(sig_pat_mat_tcga)<-paste0("s",1:nsig)
sig_pat_mat_tcga<-sig_pat_mat_tcga[reord_tcga,]
rownames(sig_pat_mat_tcga)<-paste0("s",1:nsig)
sig_pat_mat_tcga<-normaliseMatrix(sig_pat_mat_tcga,sig_thresh)
We assigned all 2-star copy-number profile samples signature proportions using the LCD function from the YAPSA package.
#extract features for lower quality samples
lowquality_britroc_CN_features<-extractCopynumberFeatures(lq_CN)
lowquality_britroc_sample_component_matrix<-generateSampleByComponentMatrix(lowquality_britroc_CN_features,CN_components)
britroc_lq_ids<-rownames(lowquality_britroc_sample_component_matrix)
#assign signatures
sig_pat_mat_lq<-YAPSA::LCD(t(lowquality_britroc_sample_component_matrix),feat_sig_mat)
rownames(sig_pat_mat_lq)<-paste0("s",1:nsig)
sig_pat_mat_lq<-normaliseMatrix(sig_pat_mat_lq,sig_thresh)
#assign signatures to remaining 2 and 3 star samples (for cases with multiple samples)
remain_samp<-samp_annotation[(!samp_annotation$IM.JBLAB_ID%in%colnames(cbind(sig_pat_mat_hq,sig_pat_mat_lq)))&(!samp_annotation$star_rating==1),]
remain_CN<-all_CN[,colnames(all_CN)%in%remain_samp$IM.JBLAB_ID]
remain_britroc_CN_features<-extractCopynumberFeatures(remain_CN)
remain_britroc_sample_component_matrix<-generateSampleByComponentMatrix(remain_britroc_CN_features,CN_components)
britroc_remain_ids<-rownames(remain_britroc_sample_component_matrix)
sig_pat_mat_remain<-YAPSA::LCD(t(remain_britroc_sample_component_matrix),feat_sig_mat)
rownames(sig_pat_mat_remain)<-paste0("s",1:nsig)
sig_pat_mat_remain<-normaliseMatrix(sig_pat_mat_remain,sig_thresh)
sig_pat_mat_britroc<-cbind(sig_pat_mat_hq,sig_pat_mat_lq)
sig_pat_mat_britroc_all<-cbind(sig_pat_mat_britroc,sig_pat_mat_remain)
To adopt the same visualisation that has been used for SNV signatures, we generated a histogram for each signature, containing the relative weighting of each of the components, colour coded by the feature distribution they come from.
| mean | sd | component | |
|---|---|---|---|
| segsize1 | 4.268619e+05 | 1.869249e+05 | segsize1 |
| segsize2 | 1.081858e+06 | 4.073021e+05 | segsize2 |
| segsize3 | 2.233030e+06 | 7.490920e+05 | segsize3 |
| segsize4 | 4.303321e+06 | 1.115832e+06 | segsize4 |
| segsize5 | 7.304340e+06 | 1.644307e+06 | segsize5 |
| segsize6 | 1.047933e+07 | 2.972414e+06 | segsize6 |
| segsize7 | 1.641912e+07 | 5.226151e+06 | segsize7 |
| segsize8 | 2.950832e+07 | 9.703791e+06 | segsize8 |
| segsize9 | 5.863890e+07 | 2.049900e+07 | segsize9 |
| segsize10 | 1.183110e+08 | 4.521059e+07 | segsize10 |
| copynumber1 | 9.979984e-01 | 1.028012e-01 | copynumber1 |
| copynumber2 | 1.981355e+00 | 1.214921e-01 | copynumber2 |
| copynumber3 | 2.561682e+00 | 1.002305e+00 | copynumber3 |
| copynumber4 | 2.991089e+00 | 1.440896e-01 | copynumber4 |
| copynumber5 | 3.970519e+00 | 1.714569e-01 | copynumber5 |
| copynumber6 | 4.271647e+00 | 1.584293e+00 | copynumber6 |
| copynumber7 | 8.392609e+00 | 3.501494e+00 | copynumber7 |
| copynumber8 | 3.086723e+01 | 2.315811e+01 | copynumber8 |
| changepoint1 | 4.945265e-01 | 1.564534e-01 | changepoint1 |
| changepoint2 | 9.683467e-01 | 1.562940e-01 | changepoint2 |
| changepoint3 | 1.178598e+00 | 2.266991e-01 | changepoint3 |
| changepoint4 | 1.822407e+00 | 3.990779e-01 | changepoint4 |
| changepoint5 | 3.006858e+00 | 1.039581e+00 | changepoint5 |
| changepoint6 | 7.314942e+00 | 3.459220e+00 | changepoint6 |
| changepoint7 | 2.873467e+01 | 2.205516e+01 | changepoint7 |
| bpchrarm1 | 6.154320e-02 | NA | bpchrarm1 |
| bpchrarm2 | 2.622567e+00 | NA | bpchrarm2 |
| bpchrarm3 | 7.777202e+00 | NA | bpchrarm3 |
| bpchrarm4 | 1.754649e+01 | NA | bpchrarm4 |
| bpchrarm5 | 3.353068e+01 | NA | bpchrarm5 |
| bp10MB1 | 6.470000e-05 | NA | bp10MB1 |
| bp10MB2 | 1.255291e+00 | NA | bp10MB2 |
| bp10MB3 | 4.074583e+00 | NA | bp10MB3 |
| osCN1 | 3.394844e-01 | NA | osCN1 |
| osCN2 | 2.627145e+00 | NA | osCN2 |
| osCN3 | 9.587145e+00 | NA | osCN3 |
b<-reshape2::melt(sig_pat_mat_britroc)
b$cohort<-"BriTROC"
p<-reshape2::melt(sig_pat_mat_pcawg)
p$cohort<-"PCAWG"
t<-reshape2::melt(sig_pat_mat_tcga)
t$cohort<-"TCGA"
pdat<-rbind(b,p,t)
colnames(pdat)<-c("Signature","Patient","Exposure","Cohort")
pdat$Signature<-plyr::revalue(pdat$Signature,
c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
signif<-ggpubr::compare_means(Exposure ~ Cohort,data=pdat,group.by = c("Signature"),
method="wilcox.test",p.adjust.method="BH")
signif<-signif %>% filter(p.signif!="ns") %>% group_by(Signature) %>% filter(p.format==min(p.format))
signif$Exposure<-1
knitr::kable(signif)
| Signature | .y. | group1 | group2 | p | p.adj | p.format | p.signif | method | Exposure |
|---|---|---|---|---|---|---|---|---|---|
| 2 | Exposure | BriTROC | TCGA | 0.0051493 | 0.0180224 | 0.0051 | ** | Wilcoxon | 1 |
| 3 | Exposure | BriTROC | TCGA | 0.0033635 | 0.0141267 | 0.0034 | ** | Wilcoxon | 1 |
| 4 | Exposure | BriTROC | PCAWG | 0.0000003 | 0.0000028 | 2.7e-07 | **** | Wilcoxon | 1 |
| 5 | Exposure | BriTROC | PCAWG | 0.0005020 | 0.0035137 | 0.0005 | *** | Wilcoxon | 1 |
ggplot(pdat,aes(x=Signature,y=Exposure,fill=Cohort))+geom_boxplot(alpha=0.3)+
my_theme+scale_fill_manual(values = cbPalette)+coord_cartesian(ylim=c(0,1))+
annotate("text",x = signif$Signature,y=signif$Exposure,label=signif$p.signif)
This section describes how the survival intervals in the Britroc-1 cohort were calculated from raw data files generated from the Britroc clinical trial database. The following raw data files required for the calculation can be obtained by writing to the Data Access Committee:
1. reg.csv - Registration Information
2. bas.csv - Baseline Information
3. flchm.csv - First Line Chemotherapy Information
4. flrsp.csv - First Line Response Information
5. dnf.csv - Death Notification Information
6. biop.csv - Image Guided Biopsy Information
Records in the flchm.csv file were filtered to retain only patients that had received platinum-based therapy as first line treatment. In instances where only month and year were available for certain time points the date was assumed to be the 1st of the month to enable calculations.
# run this code if processing from raw data files placed in the 'restricted_data/' folder
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(zoo))
filepath <-"restricted_data/"
filenames <- c("flchm", "flrsp", "bas", "dnf", "reg", "biop")
for(fname in filenames){
assign(fname,read.csv(paste0(filepath,fname,".csv"),na.strings=c(""," ","NA"), as.is = T))
}
# subset flchm to platinum-based therapy only
flchm <- flchm %>%
filter(grepl("PLAT", DRUG1)) %>%
select(-DRUG1)
# use full_join to keep missing information across datasets, combine dataframes by TRIALNO and retain only columns necessary for subsequent analysis
tokeep <- c("STUDYNO", "TRIALNO", "STAGE", "STGABC" , "DIAGDATE", "DIAGDATE_MONTH", "DIAGDATE_YEAR", "DTPROG1", "DTPROG1_MONTH","DTPROG1_YEAR", "DOD", "DTALIVE", "REGDATE", "AGE", "BIOPDT")
merged <- full_join(bas,flchm ,by="TRIALNO") %>%
full_join(flrsp,by="TRIALNO") %>%
full_join(dnf,by="TRIALNO") %>%
full_join(reg,by="TRIALNO") %>%
select(one_of(tokeep)) %>%
distinct
# format columns containing date information into date type
datecols <- c("DIAGDATE", "DTPROG1", "DOD", "DTALIVE", "REGDATE")
merged[datecols] = lapply(merged[datecols], as.Date, "%m/%d/%Y")
# fill in missing DIAGDATE and DTPROG1 values as 1st of the month where only month and year info is available
idx <- which(is.na(merged$DIAGDATE) & !is.na(merged$DIAGDATE_MONTH) & !is.na(merged$DIAGDATE_YEAR))
dates <- paste0(merged$DIAGDATE_MONTH[idx],merged$DIAGDATE_YEAR[idx])
merged$DIAGDATE[idx] <- as.Date(as.yearmon(dates, format = "%b%Y"))
idx.dtprog1 <- which(is.na(merged$DTPROG1) & !is.na(merged$DTPROG1_MONTH)& !is.na(merged$DTPROG1_YEAR))
dates.dtprog1 <- paste0(merged$DTPROG1_MONTH[idx.dtprog1],merged$DTPROG1_YEAR[idx.dtprog1])
merged$DTPROG1[idx.dtprog1] <- as.Date(as.yearmon(dates.dtprog1, format = "%b%Y"))
# add last documented clinical assessment (DTLAST) date and survival status (STATUS) columns as of 1 December 2016
# DTLAST: DTALIVE or DOD
# STATUS: a death event was recorded as 1 and survival as 0
merged$DTLAST <- merged$DTALIVE
survidx <- which(!is.na(merged$DOD))
merged$DTLAST[survidx] <- merged$DOD[survidx]
merged$STATUS <- 1
merged$STATUS[survidx] <- 0
# overall survival in BriTROC-1 patients was calculated from the date of enrolment to the date of death or the last documented clinical assessment, with data cutoff at 1 December 2016.
merged$PFS <- difftime(merged$DTPROG1, merged$DIAGDATE)
merged$OS <- difftime(merged$DTLAST, merged$DIAGDATE)
merged$INT_START <- difftime(merged$REGDATE,merged$DIAGDATE)
merged$INT_END <- difftime(merged$DTLAST,merged$DIAGDATE)
# remove erroneous rows with negative end dates
merged <- filter(merged,INT_START < INT_END)
out <- merged %>%
select(TRIALNO, AGE, STATUS, PFS, OS, INT_START, INT_END)
write.table(out, file = "data/britroc_survival_intervals.tsv", sep = "\t", quote = F, row.names = F)
Given that the BritROC study only enrolled patients with relapsed disease, it was necessary to left truncate the overall survival time. In addition, cases where the patient was not deceased were right censored. We combined the BritROC, PCAWG and TCGA cases and fit a Cox proportinal hazards model. We stratifed by study (BritROC, OV-US, OV-AU and TCGA).
library(survival)
set.seed(seed)
samp_num<-0.7 #fraction of the data considered training
tcga_clin<-read.table("data/tcga_sample_info.tsv",stringsAsFactors = F,header=T,sep="\t")
pcawg_clin<-read.table("data/pcawg_sample_info.tsv",sep="\t",header=T,stringsAsFactors = F)
pcawg_cohort<-read.table("data/pcawg_cohort_info.tsv",header=T,sep="\t",stringsAsFactors = F)
#britroc survival data
pat_info<-samp_annotation[,c("Britroc_No","IM.JBLAB_ID","star_rating")]
pat_info<-samp_annotation[grepl("IM",samp_annotation$IM.JBLAB_ID),]
pat_info<-pat_info[order(pat_info$Britroc_No,pat_info$star_rating,decreasing = T),]
pat_info<-pat_info[!duplicated(pat_info$Britroc_No),]
pat_info<-pat_info[!pat_info$IM.JBLAB_ID%in%c("IM_91","IM_70"),]#remove misclassified samples
surv_dat<-read.table("data/britroc_survival_intervals.tsv",sep="\t",header=T,stringsAsFactors = F)
surv_dat<-surv_dat[!duplicated(surv_dat$TRIALNO),]
britroc_surv_dat<-merge(surv_dat,pat_info,by.x=1,by.y=1)
rownames(britroc_surv_dat)<-britroc_surv_dat$IM.JBLAB_ID
britroc_surv_dat<-britroc_surv_dat[colnames(sig_pat_mat_britroc_all)[colnames(sig_pat_mat_britroc_all)%in%rownames(britroc_surv_dat)],]
britroc_surv_dat<-britroc_surv_dat[!(is.na(britroc_surv_dat$INT_START)|is.na(britroc_surv_dat$INT_END)),]
britroc_surv_dat<-britroc_surv_dat[!(britroc_surv_dat$INT_START>britroc_surv_dat$INT_END),]
britroc_train<-rownames(britroc_surv_dat)
britroc_test<-rownames(britroc_surv_dat)[!rownames(britroc_surv_dat)%in%britroc_train]
#pcawg survival data
pdat<-pcawg_cohort[pcawg_cohort$tumor_wgs_aliquot_id%in%colnames(sig_pat_mat_pcawg),c("tumor_wgs_aliquot_id","icgc_donor_id","dcc_project_code")]
pdat<-merge(pdat,pcawg_clin[,c("icgc_donor_id","donor_vital_status","donor_survival_time","donor_interval_of_last_followup","donor_age_at_diagnosis")],by.x=2,by.y=1)
pdat<-pdat[!duplicated(pdat$tumor_wgs_aliquot_id),]
rownames(pdat)<-pdat$tumor_wgs_aliquot_id
pdat<-pdat[colnames(sig_pat_mat_pcawg),]
pcawg_surv_dat<-pdat
pcawg_os<-pcawg_surv_dat$donor_survival_time
pcawg_os[is.na(pcawg_os)]<-pcawg_surv_dat$donor_interval_of_last_followup[is.na(pcawg_os)]
pcawg_surv_dat$os<-pcawg_os
pcawg_event<-pcawg_surv_dat$donor_vital_status
pcawg_event<-plyr::revalue(pcawg_event,c("deceased"=1,"alive"=0))
pcawg_surv_dat$event<-pcawg_event
pcawg_train<-sample(rownames(pcawg_surv_dat),round(samp_num*nrow(pcawg_surv_dat)))
pcawg_test<-rownames(pcawg_surv_dat)[!rownames(pcawg_surv_dat)%in%pcawg_train]
#tcga survival data
tcga_surv_dat<-tcga_clin
tcga_surv_dat<-tcga_surv_dat[!is.na(tcga_surv_dat$age_at_initial_pathologic_diagnosis),]
tcga_surv_dat<-tcga_surv_dat[tcga_surv_dat$bcr_aliquot_barcode%in%colnames(sig_pat_mat_tcga),]
tcga_surv_dat<-tcga_surv_dat[!is.na(tcga_surv_dat$days_to_last_followup),]
tcga_survival_sigmat<-sig_pat_mat_tcga
tcga_os<-tcga_surv_dat$days_to_death
tcga_status<-!is.na(tcga_os)
tcga_surv_dat$status<-tcga_status
tcga_os[is.na(tcga_os)]<-tcga_surv_dat$days_to_last_followup[is.na(tcga_os)]
tcga_surv_dat$os<-tcga_os
tcga_train<-sample(rownames(tcga_surv_dat),round(samp_num*nrow(tcga_surv_dat)))
tcga_test<-rownames(tcga_surv_dat)[!rownames(tcga_surv_dat)%in%tcga_train]
combined_sig_pat_mat<-cbind(sig_pat_mat_britroc_all[,rownames(britroc_surv_dat)],
sig_pat_mat_pcawg[,rownames(pcawg_surv_dat)],
sig_pat_mat_tcga[,tcga_surv_dat$bcr_aliquot_barcode])
#threshold low exposure values to 2%
combined_sig_pat_mat[combined_sig_pat_mat<=0.02]<-0.02
combined_os_survival<-survival::Surv(time= c(britroc_surv_dat$INT_START,
rep(1,length(pcawg_os)),
rep(1,length(tcga_os))),
time2=c(britroc_surv_dat$INT_END,
pcawg_surv_dat$os,
tcga_surv_dat$os),
event=c(!britroc_surv_dat$STATUS,
as.numeric(pcawg_surv_dat$event)==1,
tcga_surv_dat$status))
survival_ids<-c(rownames(britroc_surv_dat),rownames(pcawg_surv_dat),rownames(tcga_surv_dat))
sig_data<-data.frame(t(combined_sig_pat_mat),
study=c(rep("britroc",nrow(britroc_surv_dat)),
pcawg_surv_dat$dcc_project_code,rep("TCGA",nrow(tcga_surv_dat))),
age=c(britroc_surv_dat$AGE,
pcawg_surv_dat$donor_age_at_diagnosis,
tcga_surv_dat$age_at_initial_pathologic_diagnosis),
stringsAsFactors = F)
#stratify by age
sig_data$age.cat<-car::recode(as.numeric(sig_data$age), "lo:39=1; 40:44=2; 45:49=3; 50:54=4; 55:59=5; 60:64=6; 65:69=7; 70:74=8; 75:79=9; 80:hi=10")
As the signatures sum to 1, in order to perform survival analysis the exposures had to be normalised using the signature with the lowest variance:
#determine signature with lowest variance
which.min(apply(combined_sig_pat_mat,1,var))
## s5
## 5
#normalise signature exposures
sig_data_unorm<-sig_data
sig_data[,1:nsig]<-apply(sig_data[,1:nsig],2,function(x){log2(mapply("/",x,sig_data[,5]))})
#split data into training and test cohorts
sig_data_train<-sig_data[survival_ids%in%c(britroc_train,pcawg_train,tcga_train),]
sig_data_test<-sig_data[survival_ids%in%c(britroc_test,pcawg_test,tcga_test),]
combined_os_survival_train<-combined_os_survival[survival_ids%in%c(britroc_train,pcawg_train,tcga_train)]
combined_os_survival_test<-combined_os_survival[survival_ids%in%c(britroc_test,pcawg_test,tcga_test)]
Here we fit a cox proportional hazards model to predict survival with the normlised signature exposures as covariates, stratified by study and age.
os_coxph<-coxph(combined_os_survival_train ~ s1+s2+s3+s4+s6+s7 + strata(study, age.cat,na.group=T), data = sig_data_train)
summary(os_coxph)
## Call:
## coxph(formula = combined_os_survival_train ~ s1 + s2 + s3 + s4 +
## s6 + s7 + strata(study, age.cat, na.group = T), data = sig_data_train)
##
## n= 417, number of events= 241
##
## coef exp(coef) se(coef) z Pr(>|z|)
## s1 0.167312 1.182123 0.050049 3.343 0.000829 ***
## s2 0.172932 1.188786 0.064564 2.678 0.007396 **
## s3 -0.139257 0.870004 0.059735 -2.331 0.019740 *
## s4 -0.017087 0.983058 0.056725 -0.301 0.763236
## s6 0.001405 1.001406 0.051779 0.027 0.978353
## s7 -0.174542 0.839842 0.053591 -3.257 0.001126 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## s1 1.1821 0.8459 1.0717 1.3040
## s2 1.1888 0.8412 1.0475 1.3492
## s3 0.8700 1.1494 0.7739 0.9781
## s4 0.9831 1.0172 0.8796 1.0987
## s6 1.0014 0.9986 0.9048 1.1084
## s7 0.8398 1.1907 0.7561 0.9329
##
## Concordance= 0.567 (se = 0.092 )
## Rsquare= 0.048 (max possible= 0.881 )
## Likelihood ratio test= 20.58 on 6 df, p=0.002179
## Wald test = 20.53 on 6 df, p=0.002229
## Score (logrank) test = 20.95 on 6 df, p=0.001869
#test for proportional hazards
cox.zph(os_coxph)
## rho chisq p
## s1 0.0380 0.282 0.5952
## s2 0.1482 5.765 0.0163
## s3 -0.1094 2.659 0.1030
## s4 -0.0235 0.129 0.7194
## s6 -0.0356 0.292 0.5890
## s7 -0.0453 0.432 0.5112
## GLOBAL NA 6.980 0.3227
Note: exp(coef) is the Hazard ratio.
Here we tested the ability of the cox proportional hazards model to predict overall survival using the concordance index.
#predict survival using coxph model
train_predict<-predict(os_coxph,type="risk")
test_predict<-predict(os_coxph,newdata=sig_data_test,type="risk")
perf <- survcomp::concordance.index(x = test_predict, surv.time = combined_os_survival_test[,"stop"],
surv.event = combined_os_survival_test[, "status"], method = "noether",na.rm = TRUE)
perf[1:5]
## $c.index
## [1] 0.5597025
##
## $se
## [1] 0.03012423
##
## $lower
## [1] 0.5006601
##
## $upper
## [1] 0.6187449
##
## $p.value
## [1] 0.04749321
We then fit a second cox proportional hazards model to the combined test and training data.
os_coxph<-coxph(combined_os_survival~ s1+s2+s3+s4+s6+s7 + strata(study, age.cat,na.group=T), data = sig_data)
summary(os_coxph)
## Call:
## coxph(formula = combined_os_survival ~ s1 + s2 + s3 + s4 + s6 +
## s7 + strata(study, age.cat, na.group = T), data = sig_data)
##
## n= 575, number of events= 335
##
## coef exp(coef) se(coef) z Pr(>|z|)
## s1 0.138814 1.148910 0.041584 3.338 0.000843 ***
## s2 0.109738 1.115986 0.050336 2.180 0.029250 *
## s3 -0.094791 0.909563 0.048213 -1.966 0.049289 *
## s4 -0.001757 0.998244 0.047431 -0.037 0.970445
## s6 0.014342 1.014446 0.042268 0.339 0.734369
## s7 -0.122635 0.884586 0.044438 -2.760 0.005786 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## s1 1.1489 0.8704 1.0590 1.2465
## s2 1.1160 0.8961 1.0111 1.2317
## s3 0.9096 1.0994 0.8275 0.9997
## s4 0.9982 1.0018 0.9096 1.0955
## s6 1.0144 0.9858 0.9338 1.1021
## s7 0.8846 1.1305 0.8108 0.9651
##
## Concordance= 0.544 (se = 0.072 )
## Rsquare= 0.036 (max possible= 0.928 )
## Likelihood ratio test= 20.87 on 6 df, p=0.001933
## Wald test = 20.72 on 6 df, p=0.002062
## Score (logrank) test = 20.99 on 6 df, p=0.001841
#check for non-proportional hazards
cox.zph(os_coxph)
## rho chisq p
## s1 0.00991 0.0302 0.8620
## s2 0.10822 4.4903 0.0341
## s3 -0.06856 1.4517 0.2283
## s4 0.02384 0.1926 0.6607
## s6 -0.02723 0.2470 0.6192
## s7 -0.04619 0.6618 0.4159
## GLOBAL NA 6.7374 0.3458
#Compute hazard ratio table
predicted_survival<-predict(os_coxph,type="risk")
HR <- round(exp(coef(os_coxph)), 2)
CI <- round(exp(confint(os_coxph)), 2)
P <- round(coef(summary(os_coxph))[,5], 3)
colnames(CI) <- c("Lower", "Higher")
hazard_table <- as.data.frame(cbind(HR, CI, P))
hazard_table<-cbind(Signature=row.names(hazard_table),hazard_table)
knitr::kable(hazard_table)
| Signature | HR | Lower | Higher | P | |
|---|---|---|---|---|---|
| s1 | s1 | 1.15 | 1.06 | 1.25 | 0.001 |
| s2 | s2 | 1.12 | 1.01 | 1.23 | 0.029 |
| s3 | s3 | 0.91 | 0.83 | 1.00 | 0.049 |
| s4 | s4 | 1.00 | 0.91 | 1.10 | 0.970 |
| s6 | s6 | 1.01 | 0.93 | 1.10 | 0.734 |
| s7 | s7 | 0.88 | 0.81 | 0.97 | 0.006 |
To group patients based on their copy-number signature exposures we used the NBClust package to determine the number of clusters present in the data and cluster the patients.
mydata<-t(combined_sig_pat_mat[,])
#determine number of clusters
fit <- NbClust::NbClust(mydata,method="ward.D2")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 1 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 11 as the best number of clusters
## * 2 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
#visualise signature exposures within clusters
sig_data$clust=fit$Best.partition
pdat<-data.frame(t(combined_sig_pat_mat))
pdat$clust<-fit$Best.partition
pdat$sample<-rownames(pdat)
pdat<-reshape2::melt(pdat,id.vars=c(8,9))
pdat$variable<-plyr::revalue(pdat$variable,
c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
ggplot(pdat,aes(x=variable,y=value))+geom_boxplot()+geom_jitter(alpha=0.1)+facet_grid(clust ~ .)+my_theme+xlab("Signature")+ylab("Exposure")
pdat$sample<-factor(pdat$sample,levels=names(predicted_survival)[order(predicted_survival,decreasing=T)])
pdat$clust<-factor(pdat$clust,levels=3:1)
clust_plot<-ggplot(pdat,aes(y=clust,x=sample))+geom_tile()+my_theme+theme(axis.text.x =element_blank(),axis.ticks.x = element_blank())+
xlab("")+ylab("")+scale_fill_discrete()+theme(legend.position = "none")
clust_plot
We also fit a cox-proportional hazards model using the clusters as covariates.
sig_data$clust<-factor(sig_data$clust,levels=3:1)
os_coxph<-coxph(combined_os_survival~ clust + strata(study, age.cat), data = sig_data)
summary(os_coxph)
## Call:
## coxph(formula = combined_os_survival ~ clust + strata(study,
## age.cat), data = sig_data)
##
## n= 575, number of events= 335
##
## coef exp(coef) se(coef) z Pr(>|z|)
## clust2 0.53158 1.70161 0.17598 3.021 0.00252 **
## clust1 0.05604 1.05764 0.14193 0.395 0.69296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## clust2 1.702 0.5877 1.2052 2.402
## clust1 1.058 0.9455 0.8008 1.397
##
## Concordance= 0.518 (se = 0.063 )
## Rsquare= 0.017 (max possible= 0.928 )
## Likelihood ratio test= 10.04 on 2 df, p=0.006602
## Wald test = 10.71 on 2 df, p=0.00473
## Score (logrank) test = 10.89 on 2 df, p=0.004324
cox.zph(os_coxph)
## rho chisq p
## clust2 0.0614 1.209 0.271
## clust1 0.0414 0.548 0.459
## GLOBAL NA 1.261 0.532
While this model showed significant predictive preformance, the exposures across the clusters (above) reveal that cluster 2 is based primarily on signal from signature 1 exposures, and thus a cluster based model does not add information above and beyond the signature model.
We combined all three cohorts to test if mutant carriers had significantly higher signature exposures compared to wild-type carriers for known cancer causing genes.
library(VariantAnnotation)
#this chunk of code is not run be default due to data restrictions. Please download the required files (see below) to execute this segment of code.
all_mutations<-c()
#TCGA cohort
#download Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0 from Broad GDAC https://gdac.broadinstitute.org/
tcga_ids<-colnames(sig_pat_mat_tcga)
tcga_ids_missing<-c()
for(i in tcga_ids)
{
fn<-paste0("restricted_data/gdac.broadinstitute.org_OV.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0/",
substr(i,1,15),".maf.txt")
if(file.exists(fn))
{
curr_mut<-read.table(fn,sep="\t",header=T,stringsAsFactors = F)
all_mutations<-rbind(all_mutations,cbind(i,curr_mut$Chromosome,curr_mut$Start_position,as.character(curr_mut$Reference_Allele),as.character(curr_mut$Tumor_Seq_Allele2)))
}else{
tcga_ids_missing<-c(tcga_ids_missing,i)
}
}
colnames(all_mutations)<-c("sample","chromosome","position","ref","alt")
tcga_mutations<-all_mutations
#PCAWG cohort
#download the final_consensus_12oct_passonly_snv_indel.tar variant calls from PCAWG then run the following commands:
#grep -v -E 'IGR|Silent|RNA|lincRNA|Intron|UTR|Flank|Splice' October_2016_ovary.snv_indel.maf > October_2016_ovary_deleterious.snv_indel.maf
#grep -v IGR October_2016_ovary.snv_indel.maf | cut -f13,2,3,8,10 - | awk '{print $5 "\t" $1 "\t" $2 "\t" $3 "\t" $4 }' - > October_2016_ovary.snv_indel.CGIformat
pcawg_ids<-colnames(sig_pat_mat_pcawg)
pcawg_mut_table<-read.table("restricted_data/October_2016_ovary.snv_indel.CGIformat",header=T,sep=",",stringsAsFactors=F)
colnames(pcawg_mut_table)<-c("sample","chromosome","position","ref","alt")
pcawg_ids_missing<-pcawg_ids[!pcawg_ids%in%pcawg_mut_table$sample]
pcawg_mut_table<-pcawg_mut_table[pcawg_mut_table$sample%in%pcawg_ids,]
pcawg_mut_table<-as.matrix(pcawg_mut_table)
all_mutations<-rbind(all_mutations,pcawg_mut_table)
pcawg_mutations<-pcawg_mut_table
#BritROC cohort
#download VCFs from xxxxxxx and put in restricted_data/SNVs directory
isDeleterious<-function(x){
grepl(paste("3_prime_UTR_variant",
"5_prime_UTR_premature_start_codon_gain_variant",
"5_prime_UTR_variant",
"downstream_gene_variant",
"initiator_codon_variant",
#"intergenic_region",
"intragenic_variant",
"intron_variant",
"missense_variant",
"missense_variant&splice_region_variant",
"non_coding_transcript_exon_variant",
"splice_acceptor_variant&intron_variant",
"splice_donor_variant&intron_variant",
"splice_region_variant",
"splice_region_variant&intron_variant",
"splice_region_variant&non_coding_transcript_exon_variant",
"splice_region_variant&synonymous_variant",
"start_lost",
"start_lost&splice_region_variant",
"stop_gained",
"stop_gained&splice_region_variant",
"stop_lost",
"stop_lost&splice_region_variant",
"stop_retained_variant",
"synonymous_variant",
"upstream_gene_variant",
sep="|"),x)}
isPASS<-function(x){
grepl("PASS",x,fixed=TRUE)}
prefilters <- FilterRules(list(PASS=isPASS,deleterious=isDeleterious))
britroc_mutations<-c()
library(VariantAnnotation)
for(i in 1:nrow(britroc_ids))
{
study_id<-as.character(britroc_ids[i,1])
jblab_id<-as.character(britroc_ids[i,2])
filename<-paste0("restricted_data/SNVs/",study_id,".snv.filters.blacklist.vcf.gz")
if(file.exists(filename))
{
tabix.file <- TabixFile(filename, yieldSize=10000)
destination.file<-paste0("restricted_data/SNVs/",study_id,".snv.deleterious.vcf")
if(!file.exists(destination.file))
{
filterVcf(tabix.file,"hg19", destination.file, prefilters=prefilters, verbose=TRUE)
}
curr_vcf<-readVcf(destination.file)
britroc_mutations<-rbind(britroc_mutations,cbind(jblab_id,as.character(seqnames(curr_vcf)),
as.character(start(curr_vcf)),
as.character(ref(curr_vcf)),
as.character(unlist(alt(curr_vcf)))))
}
}
all_mutations<-rbind(all_mutations,britroc_mutations)
all_mutations[,2]<-paste0("chr",all_mutations[,2])
colnames(all_mutations)<-c("sample","chr","pos","ref","alt")
all_mutations[,3]<-as.numeric(all_mutations[,3])
#calculate loss and amplification for CGI
CGI_cancer_genes<-read.table("data/cancer_genes_upon_mutations_or_CNAs.tsv",sep="\t",stringsAsFactors = F,header=T)
CGI_cancer_genes<-unique(CGI_cancer_genes$gene)
#get gene TSS
library(org.Hs.eg.db)
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
loc<-TxDb.Hsapiens.UCSC.hg19.knownGene
CGI_genemap<-select(org.Hs.eg.db, CGI_cancer_genes, c("ENTREZID","GENENAME"), "ALIAS")
gene_locs<-select(loc,CGI_genemap$ENTREZID,c("GENEID","TXCHROM","TXSTART","TXEND"),"GENEID")
gene_locs<-gene_locs[!duplicated(gene_locs$GENEID),]
gene_coords<-c()
for(i in CGI_cancer_genes)
{
currg<-CGI_genemap[CGI_genemap$ALIAS==i,"ENTREZID"]
if(length(currg)==1)
{
curr_loc<-gene_locs[gene_locs$GENEID==currg,]
}
else
{
for(j in currg)
{
curr_loc<-gene_locs[gene_locs$GENEID==j,]
if(!is.na(curr_loc[1,2]))
{
break
}
}
}
if(!is.null(curr_loc))
{
chr<-curr_loc[1,"TXCHROM"]
tss<-abs(curr_loc[1,"TXSTART"])
tss_end<-abs(curr_loc[1,"TXEND"])
gene_coords<-rbind(gene_coords,as.character(c(i,chr,tss,tss_end)))
}
}
colnames(gene_coords)<-c("gene","chr","tss","tss_end")
gene_coords<-data.frame(gene_coords,stringsAsFactors = F)
gene_coords$chr<-substring(gene_coords$chr,4)
gene_coords<-gene_coords[gene_coords$chr%in%c(1:22),]
britroc_segTab<-c()
britroc_segTab_remain<-c()
for(i in 1:ncol(hq_CN))
{
britroc_segTab[[colnames(hq_CN[,i])]]<-getSegTable(hq_CN[,i])
}
for(i in 1:ncol(lq_CN))
{
britroc_segTab[[colnames(lq_CN[,i])]]<-getSegTable(lq_CN[,i])
}
for(i in 1:ncol(remain_CN))
{
britroc_segTab_remain[[colnames(remain_CN[,i])]]<-getSegTable(remain_CN[,i])
}
all_segTabs<-c(britroc_segTab,pcawg_segTabs,tcga_segTabs,britroc_segTab_remain)
cn_mat<-c()
for(j in 1:nrow(gene_coords))
{
res<-c()
name<-gene_coords[j,"gene"]
chr<-gene_coords[j,"chr"]
tss<-abs(as.numeric(gene_coords[j,"tss"]))
for(i in colnames(all_sig))
{
segTable<-all_segTabs[[i]]
cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<tss&as.numeric(segTable[,3])>tss,4])
if(!length(cn)==1)
{
modtss<-tss+1000000
cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<modtss&as.numeric(segTable[,3])>modtss,4])
if(!length(cn)==1)
{
cn<-NA
}
}
res<-c(res,cn)
}
cn_mat<-rbind(cn_mat,c(name,res))
}
colnames(cn_mat)<-c("Gene",colnames(all_sig))
rownames(cn_mat)<-cn_mat[,1]
cn_mat<-cn_mat[,-1]
del<-apply(cn_mat,1,function(x){
x<-x[!is.na(x)]
names(x[as.numeric(x)<0.4])})
del<-del[unlist(lapply(del,length))>0]
amp<-apply(cn_mat,1,function(x){
x<-x[!is.na(x)]
names(x[as.numeric(x)>5])})
amp<-amp[unlist(lapply(amp,length))>0]
cn_forCGI<-c()
for(i in names(del))
{
cn_forCGI<-rbind(cn_forCGI,cbind(del[[i]],i,"del"))
}
for(i in names(amp))
{
cn_forCGI<-rbind(cn_forCGI,cbind(amp[[i]],i,"amp"))
}
colnames(cn_forCGI)<-c("sample","gene","cna")
write.table(cn_forCGI,"cn_for_CGI.txt",quote=F,row.names=F,sep="\t")
gene_coords_gr<-gene_coords
colnames(gene_coords_gr)<-c("gene","chr","start","end")
gene_coords_gr$chr<-paste0("chr",gene_coords_gr$chr)
gene_coords_gr<-makeGRangesFromDataFrame(gene_coords_gr,seqinfo = paste0("chr",c(1:22,"X","Y")))
colnames(tcga_mutations)<-colnames(all_mutations)
library(rtracklayer)
tcga_mutations<-data.frame(tcga_mutations)
coords_to_liftover<-data.frame(chr=paste0("chr",tcga_mutations$chr),start=tcga_mutations$pos,end=tcga_mutations$pos,tcga_mutations[,c(1,4,5)])
gr<-makeGRangesFromDataFrame(coords_to_liftover,keep.extra.columns = TRUE,ignore.strand = TRUE)
chain<-import.chain("data/hg18ToHg19.over.chain")
gr_hg19<-liftOver(gr,chain)
gr_hg19<-subsetByOverlaps(unlist(gr_hg19),gene_coords_gr)
tcga_mutations_hg19<-data.frame(gr_hg19)[,c("sample","seqnames","start","ref","alt")]
colnames(tcga_mutations_hg19)<-c("sample","chr","pos","ref","alt")
tcga_mutations_hg19$chr<-substring(tcga_mutations_hg19$chr,4)
write.table(tcga_mutations_hg19,"tcga_mutations_for_CGI.txt",quote=F,row.names=F,sep="\t")
britroc_mutations<-data.frame(britroc_mutations)
colnames(britroc_mutations)<-colnames(all_mutations)
britroc_mutations_gr<-data.frame(chr=paste0("chr",britroc_mutations$chr),start=britroc_mutations$pos,end=britroc_mutations$pos,britroc_mutations[,c(1,4,5)])
britroc_mutations_gr<-makeGRangesFromDataFrame(britroc_mutations_gr,keep.extra.columns = TRUE,ignore.strand = TRUE)
britroc_mutations_gr<-subsetByOverlaps(unlist(britroc_mutations_gr),gene_coords_gr)
britroc_mutations<-data.frame(britroc_mutations_gr)[,c("sample","seqnames","start","ref","alt")]
colnames(britroc_mutations)<-c("sample","chr","pos","ref","alt")
britroc_mutations$chr<-substring(britroc_mutations$chr,4)
write.table(britroc_mutations,"britroc_mutations_for_CGI.txt",quote=F,row.names=F,sep="\t")
pcawg_mutations<-data.frame(pcawg_mutations)
colnames(pcawg_mutations)<-colnames(all_mutations)
pcawg_mutations_gr<-data.frame(chr=paste0("chr",pcawg_mutations$chr),start=pcawg_mutations$pos,end=pcawg_mutations$pos,pcawg_mutations[,c(1,4,5)])
pcawg_mutations_gr<-makeGRangesFromDataFrame(pcawg_mutations_gr,keep.extra.columns = TRUE,ignore.strand = TRUE)
pcawg_mutations_gr<-subsetByOverlaps(unlist(pcawg_mutations_gr),gene_coords_gr)
pcawg_mutations<-data.frame(pcawg_mutations_gr)[,c("sample","seqnames","start","ref","alt")]
colnames(pcawg_mutations)<-c("sample","chr","pos","ref","alt")
pcawg_mutations$chr<-substring(pcawg_mutations$chr,4)
write.table(pcawg_mutations,"pcawg_mutations_for_CGI.txt",quote=F,row.names=F,sep="\t")
To run the Cancer Genome Interpreter analysis use the python script launch_CGI.py located in the scripts directory.
#after the above analysis is complete, results should be downloaded into the following directories:
#PCAWG:restricted_data/pcawg/
#TCGA:restricted_data/tcga/
#BRITROC:restricted_data/britroc/
#CNAs:resticted_data/cnas
CGImuts_pcawg<-read.table("restricted_data/pcawg/mutation_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGImuts_tcga<-read.table("restricted_data/tcga/mutation_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGImuts_britroc<-read.table("restricted_data/britroc/mutation_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGImuts<-rbind(CGImuts_britroc,CGImuts_pcawg,CGImuts_tcga)
CGImuts<-CGImuts[CGImuts$driver_mut_prediction%in%c("TIER 1","TIER 2"),]
write.table(CGImuts,"mutated_cases.tsv",row.names = F,sep="\t")
CGIcna<-read.table("restricted_data/cnas/cna_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGIcna<-CGIcna[!CGIcna$driver_statement%in%c("predicted passenger"),]
write.table(CGIcna,"del_amp_cases.tsv",row.names = F,sep="\t")
CGIcna<-CGIcna[CGIcna$sample%in%all_mutations[,1],]
all_mut<-list()
for(i in unique(CGImuts$sample))
{
all_mut[[i]]<-CGImuts[CGImuts$sample==i,"gene"]
}
for(i in unique(CGIcna$sample))
{
if(is.null(all_mut[[i]]))
{
all_mut[[i]]<-CGIcna[CGIcna$sample==i,"gene"]
}
else
{
all_mut[[i]]<-c(all_mut[[i]],CGIcna[CGIcna$sample==i,"gene"])
}
}
all_mut<-all_mut[names(all_mut)%in%colnames(all_sig)]
#contruct mutation matrix
all_genes<-unique(as.character(unlist(all_mut)))
all_genes<-all_genes[!is.na(all_genes)]
library(ReactomePA)
library(org.Hs.eg.db)
genemap<-AnnotationDbi::select(org.Hs.eg.db, all_genes, c("ENTREZID","GENENAME"), "ALIAS")
genemap<-genemap[!duplicated(genemap$ALIAS),]
x <- enrichPathway(gene=genemap$ENTREZID,pvalueCutoff=0.05,qvalueCutoff=0.05,readable = T)
enriched_pathways<-x
x<-as.data.frame(x)
x<-x[x$Count>=5,]
pathway_genes<-list()
for(i in 1:nrow(x))
{
gene_ratio<-as.numeric(unlist(strsplit(x[i,]$GeneRatio,"/")))
bg_ratio<-as.numeric(unlist(strsplit(x[i,]$BgRatio,"/")))
if((gene_ratio[1]/bg_ratio[1])>0.05)
{
pathway_genes[[x[i,]$ID]]<-unlist(strsplit(x[i,]$geneID,"/"))
}
}
all_genes<-all_genes[all_genes%in%unique(unlist(pathway_genes))]
all_genes<-all_genes[!all_genes=="TP53"]
mut_matrix<-c()
for(g in all_genes)
{
mut_matrix<-cbind(mut_matrix,unlist(lapply(all_mut,function(x){sum(x%in%g)>0})))
}
colnames(mut_matrix)<-all_genes
mut_matrix<-mut_matrix[,colSums(mut_matrix)>=1]
mut_matrix<-data.frame(mut_matrix)
all_sequenced_patients<-unique(all_mutations[,1])
samples_to_add<-all_sequenced_patients[!all_sequenced_patients%in%rownames(mut_matrix)]
unmut<-rep(FALSE,length(samples_to_add)*ncol(mut_matrix))
dim(unmut)<-c(length(samples_to_add),ncol(mut_matrix))
rownames(unmut)<-samples_to_add
colnames(unmut)<-colnames(mut_matrix)
mut_matrix<-rbind(mut_matrix,unmut)
mut_matrix<-mut_matrix[rownames(mut_matrix)%in%colnames(all_sig),]
#request the following files from the DACO for this project: HRstatus.csv and somaticHRD.csv
HRmuts<-read.table("restricted_data/HRstatus.csv",header=T,stringsAsFactors = F,sep=",")
HRmuts$Gene[is.na(HRmuts$Gene)]<-"WT"
BRCA1_mutants<-britroc_ids[britroc_ids[,1]%in%HRmuts$BritRoc.No[HRmuts$Gene=="BRCA1"],2]
BRCA2_mutants<-britroc_ids[britroc_ids[,1]%in%HRmuts$BritRoc.No[HRmuts$Gene=="BRCA2"],2]
somaticBRCA<-read.table("restricted_data/somaticHRD.csv",header=T,stringsAsFactors = F,sep=",")
BRCA1_mutants<-c(BRCA1_mutants,britroc_ids[britroc_ids[,2]%in%somaticBRCA[somaticBRCA$Symbol..Gene.ID.=="BRCA1","Sample"],2])
BRCA2_mutants<-c(BRCA2_mutants,britroc_ids[britroc_ids[,2]%in%somaticBRCA[somaticBRCA$Symbol..Gene.ID.=="BRCA2","Sample"],2])
mut_matrix[rownames(mut_matrix)%in%BRCA1_mutants,"BRCA1"]<-TRUE
mut_matrix[rownames(mut_matrix)%in%BRCA2_mutants,"BRCA2"]<-TRUE
#get germline BRCA mutations for TCGA
BRCA1<-c("TCGA-10-0931",
"TCGA-13-1408","TCGA-23-102","TCGA-23-1118","TCGA-23-2078","TCGA-23-2079","TCGA-13-0887",
"TCGA-13-1494","TCGA-13-0893","TCGA-13-0903","TCGA-61-2109","TCGA-04-1356","TCGA-59-2348",
"TCGA-13-1512","TCGA-09-1669","TCGA-25-2392","TCGA-24-2298","TCGA-24-1470","TCGA-57-1582",
"TCGA-09-2051","TCGA-13-0883","TCGA-23-1122","TCGA-23-2077","TCGA-23-2081","TCGA-25-2401",
"TCGA-09-2045","TCGA-61-2008")
BRCA2<-c("TCGA-24-0975","TCGA-24-2288","TCGA-13-0900","TCGA-04-1367","TCGA-25-2404","TCGA-24-1463",
"TCGA-24-1417","TCGA-24-2024","TCGA-04-1336","TCGA-13-0913","TCGA-13-0886","TCGA-13-1498",
"TCGA-13-1499","TCGA-24-2280","TCGA-59-2351","TCGA-13-0726","TCGA-24-2293","TCGA-24-1562",
"TCGA-13-1512","TCGA-23-1026")
mut_matrix[substring(rownames(mut_matrix),1,12)%in%BRCA1,"BRCA1"]<-TRUE
mut_matrix[substring(rownames(mut_matrix),1,12)%in%BRCA2,"BRCA2"]<-TRUE
saveRDS(mut_matrix,"data/mutation_matrix.rds")
saveRDS(pathway_genes,"data/pathway_genes.rds")
saveRDS(enriched_pathways,"data/enriched_pathways.rds")
saveRDS(CGImuts[,c("sample","gene","consequence","gene_role")],"data/CGI_mut_role.rds")
saveRDS(CGIcna[,c("sample","gene","cna","gene_role")],"data/CGI_cna_role.rds")
mut_matrix<-readRDS("data/mutation_matrix.rds")
pathway_genes<-readRDS("data/pathway_genes.rds")
enriched_pathways<-readRDS("data/enriched_pathways.rds")
min_cases_mutated<-24
min_cases_top_two_genes<-5
pathway_mut_matrix<-c()
remove_pathway<-c()
matrix_pathways<-c()
pathway_mutated_genecount<-list()
all_exposures_mut_wt<-c()
for(p in names(pathway_genes))
{
curr_genes<-pathway_genes[[p]]
curr_mut_matrix<-as.matrix(mut_matrix[,colnames(mut_matrix)%in%curr_genes])
if(sum(rowSums(curr_mut_matrix)>0)>=min_cases_mutated&sum(colSums(curr_mut_matrix)>=min_cases_top_two_genes)>=0)
{
if(ncol(curr_mut_matrix)>1)
{
pathway_mut_matrix<-cbind(pathway_mut_matrix,rowSums(curr_mut_matrix)>0)
matrix_pathways<-c(matrix_pathways,p)
pathway_mutated_genecount[[p]]<-colSums(curr_mut_matrix)
}else if(sum(colnames(mut_matrix)%in%curr_genes)==0){
remove_pathway<-c(remove_pathway,p)
}else{
pathway_mut_matrix<-cbind(pathway_mut_matrix,curr_mut_matrix[,1])
matrix_pathways<-c(matrix_pathways,p)
pathway_mutated_genecount[[p]]<-sum(curr_mut_matrix)
}
}
}
colnames(pathway_mut_matrix)<-matrix_pathways
mut_ratio_info<-c()
for(g in colnames(pathway_mut_matrix))
{
mutated<-pathway_mut_matrix[,g]
for(s in rownames(all_sig))
{
mut<-as.numeric(all_sig[s,rownames(pathway_mut_matrix)[mutated]])
nonmut<-as.numeric(all_sig[s,rownames(pathway_mut_matrix)[!mutated]])
all_exposures_mut_wt<-rbind(all_exposures_mut_wt,rbind(cbind(s,g,"mut",mut),cbind(s,g,"wt",nonmut)))
if(median(mut)>median(nonmut))
{
mut_ratio_info<-rbind(mut_ratio_info,c(g,s,median(mut)-median(nonmut),wilcox.test(mut,nonmut,alternative="greater")$p.value))
}
}
}
colnames(mut_ratio_info)<-c("Type","Signature","mean_diff","pvalue")
mut_ratio_info<-data.frame(mut_ratio_info,stringsAsFactors=F)
mut_ratio_info<-mut_ratio_info[abs(as.numeric(mut_ratio_info$mean_diff))>=0.10,]
mut_ratio_info$qvalue<-p.adjust(mut_ratio_info[,4],method="BH")
mut_ratio_info$Description<-enriched_pathways[mut_ratio_info$Type,"Description"]
mut_ratio_info$Signature<-factor(mut_ratio_info$Signature,levels=paste0("s",1:nsig))
mut_ratio_info$Signature<-plyr::revalue(mut_ratio_info$Signature,
c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
mut_ratio_info<-mut_ratio_info[mut_ratio_info$qvalue<0.005,]
mut_ratio_info<-merge(mut_ratio_info,enriched_pathways[,c("ID","geneID")],by.x=1,by.y=1)
mut_ratio_info<-mut_ratio_info[order(mut_ratio_info$Signature,mut_ratio_info$qvalue),]
colnames(all_exposures_mut_wt)<-c("Signature","ID","Status","Exposure")
all_exposures_mut_wt<-data.frame(all_exposures_mut_wt,stringsAsFactors = F)
all_exposures_mut_wt$Description<-enriched_pathways[all_exposures_mut_wt$ID,"Description"]
knitr::kable(mut_ratio_info[,1:6],"html") %>%
kableExtra::kable_styling(full_width=F) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| Type | Signature | mean_diff | pvalue | qvalue | Description | |
|---|---|---|---|---|---|---|
| 172 | R-HSA-8849471 | 1 | 0.252588740179993 | 8.46584283845526e-07 | 0.0000010 | PTK6 Regulates RHO GTPases, RAS GTPase and MAP kinases |
| 141 | R-HSA-5658442 | 1 | 0.246731682237831 | 4.58409095238899e-06 | 0.0000051 | Regulation of RAS by GAPs |
| 144 | R-HSA-5685942 | 3 | 0.198180242225367 | 0.00159530945990504 | 0.0017180 | HDR through Homologous Recombination (HRR) |
| 148 | R-HSA-5693579 | 3 | 0.198180242225367 | 0.00159530945990504 | 0.0017180 | Homologous DNA Pairing and Strand Exchange |
| 145 | R-HSA-5693537 | 3 | 0.196550061541629 | 0.00259400741114232 | 0.0027482 | Resolution of D-Loop Structures |
| 146 | R-HSA-5693554 | 3 | 0.196550061541629 | 0.00259400741114232 | 0.0027482 | Resolution of D-loop Structures through Synthesis-Dependent Strand Annealing (SDSA) |
| 147 | R-HSA-5693568 | 3 | 0.196550061541629 | 0.00259400741114232 | 0.0027482 | Resolution of D-loop Structures through Holliday Junction Intermediates |
| 149 | R-HSA-5693616 | 3 | 0.187943638170493 | 0.00319193917662721 | 0.0033635 | Presynaptic phase of homologous DNA pairing and strand exchange |
| 150 | R-HSA-6785807 | 4 | 0.109545051623606 | 4.00495361286718e-25 | 0.0000000 | Interleukin-4 and 13 signaling |
| 62 | R-HSA-2219528 | 4 | 0.109095994308619 | 3.04174513143323e-23 | 0.0000000 | PI3K/AKT Signaling in Cancer |
| 84 | R-HSA-3769402 | 4 | 0.115938226332199 | 1.3729603212145e-21 | 0.0000000 | Deactivation of the beta-catenin transactivating complex |
| 76 | R-HSA-2730905 | 4 | 0.104936071168879 | 4.19804789394049e-21 | 0.0000000 | Role of LAT2/NTAL/LAB on calcium mobilization |
| 10 | R-HSA-1257604 | 4 | 0.104317921864257 | 2.82432941959549e-20 | 0.0000000 | PIP3 activates AKT signaling |
| 41 | R-HSA-180292 | 4 | 0.104317921864257 | 2.82432941959549e-20 | 0.0000000 | GAB1 signalosome |
| 87 | R-HSA-388841 | 4 | 0.103285240682332 | 1.19224217769441e-19 | 0.0000000 | Costimulation by the CD28 family |
| 56 | R-HSA-198203 | 4 | 0.103395592455067 | 1.24618903058137e-19 | 0.0000000 | PI3K/AKT activation |
| 104 | R-HSA-452723 | 4 | 0.118327541097481 | 2.83909515508328e-18 | 0.0000000 | Transcriptional regulation of pluripotent stem cells |
| 107 | R-HSA-5218920 | 4 | 0.102817780424228 | 1.62863549453764e-17 | 0.0000000 | VEGFR2 mediated vascular permeability |
| 58 | R-HSA-199418 | 4 | 0.100050735691377 | 3.19664360147671e-17 | 0.0000000 | Negative regulation of the PI3K/AKT network |
| 3 | R-HSA-109704 | 4 | 0.100389973795218 | 1.22450079485844e-16 | 0.0000000 | PI3K Cascade |
| 158 | R-HSA-6811558 | 4 | 0.108667130091737 | 2.4115328708105e-16 | 0.0000000 | PI5P, PP2A and IER3 Regulate PI3K/AKT Signaling |
| 64 | R-HSA-2219530 | 4 | 0.104814160061847 | 1.17910747115717e-15 | 0.0000000 | Constitutive Signaling by Aberrant PI3K in Cancer |
| 176 | R-HSA-8864260 | 4 | 0.100850241511579 | 1.5469307229411e-15 | 0.0000000 | Transcriptional regulation by the AP-2 (TFAP2) family of transcription factors |
| 80 | R-HSA-373760 | 4 | 0.115456499609437 | 1.99543690448472e-15 | 0.0000000 | L1CAM interactions |
| 67 | R-HSA-2559580 | 4 | 0.109615819583484 | 4.24875054152071e-11 | 0.0000000 | Oxidative Stress Induced Senescence |
| 129 | R-HSA-5654727 | 4 | 0.109138003141597 | 6.42384871144081e-11 | 0.0000000 | Negative regulation of FGFR2 signaling |
| 132 | R-HSA-5654733 | 4 | 0.100210636440278 | 6.65771113813847e-09 | 0.0000000 | Negative regulation of FGFR4 signaling |
| 69 | R-HSA-2559582 | 4 | 0.1209934957094 | 1.60910024609967e-08 | 0.0000000 | Senescence-Associated Secretory Phenotype (SASP) |
| 72 | R-HSA-2559585 | 4 | 0.10549338574056 | 4.65118251830628e-08 | 0.0000001 | Oncogene Induced Senescence |
| 102 | R-HSA-450294 | 4 | 0.14910156336646 | 1.06327477150069e-07 | 0.0000001 | MAP kinase activation in TLR cascade |
| 18 | R-HSA-166166 | 4 | 0.147241750288408 | 1.69755501756731e-07 | 0.0000002 | MyD88-independent TLR3/TLR4 cascade |
| 25 | R-HSA-168164 | 4 | 0.147241750288408 | 1.69755501756731e-07 | 0.0000002 | Toll Like Receptor 3 (TLR3) Cascade |
| 177 | R-HSA-937061 | 4 | 0.147241750288408 | 1.69755501756731e-07 | 0.0000002 | TRIF-mediated TLR3/TLR4 signaling |
| 14 | R-HSA-166054 | 4 | 0.136149750110542 | 1.84027943483452e-07 | 0.0000002 | Activated TLR4 signalling |
| 21 | R-HSA-168138 | 4 | 0.136149750110542 | 1.84027943483452e-07 | 0.0000002 | Toll Like Receptor 9 (TLR9) Cascade |
| 31 | R-HSA-168181 | 4 | 0.136149750110542 | 1.84027943483452e-07 | 0.0000002 | Toll Like Receptor 7/8 (TLR7/8) Cascade |
| 181 | R-HSA-975155 | 4 | 0.136149750110542 | 1.84027943483452e-07 | 0.0000002 | MyD88 dependent cascade initiated on endosome |
| 16 | R-HSA-166058 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | MyD88:Mal cascade initiated on plasma membrane |
| 23 | R-HSA-168142 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | Toll Like Receptor 10 (TLR10) Cascade |
| 27 | R-HSA-168176 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | Toll Like Receptor 5 (TLR5) Cascade |
| 29 | R-HSA-168179 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | Toll Like Receptor TLR1:TLR2 Cascade |
| 33 | R-HSA-168188 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | Toll Like Receptor TLR6:TLR2 Cascade |
| 44 | R-HSA-181438 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | Toll Like Receptor 2 (TLR2) Cascade |
| 179 | R-HSA-975138 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | TRAF6 mediated induction of NFkB and MAP kinases upon TLR7/8 or 9 activation |
| 183 | R-HSA-975871 | 4 | 0.141579622011902 | 2.08023669764227e-07 | 0.0000002 | MyD88 cascade initiated on plasma membrane |
| 12 | R-HSA-166016 | 4 | 0.135476447152699 | 3.84382580101352e-07 | 0.0000005 | Toll Like Receptor 4 (TLR4) Cascade |
| 35 | R-HSA-168898 | 4 | 0.135476447152699 | 3.84382580101352e-07 | 0.0000005 | Toll-Like Receptors Cascades |
| 100 | R-HSA-400685 | 4 | 0.103852168540807 | 1.51313003420957e-06 | 0.0000017 | Sema4D in semaphorin signaling |
| 97 | R-HSA-3928665 | 4 | 0.115442790036247 | 4.68758733659412e-06 | 0.0000052 | EPH-ephrin mediated repulsion of cells |
| 54 | R-HSA-1963642 | 4 | 0.134373005439062 | 5.71484609954307e-06 | 0.0000063 | PI3K events in ERBB2 signaling |
| 166 | R-HSA-69563 | 6 | 0.163574231031668 | 6.16914049604083e-36 | 0.0000000 | p53-Dependent G1 DNA Damage Response |
| 167 | R-HSA-69580 | 6 | 0.163574231031668 | 6.16914049604083e-36 | 0.0000000 | p53-Dependent G1/S DNA damage checkpoint |
| 168 | R-HSA-69615 | 6 | 0.163574231031668 | 6.16914049604083e-36 | 0.0000000 | G1/S DNA Damage Checkpoints |
| 71 | R-HSA-2559583 | 6 | 0.118512570615044 | 1.79665216695224e-32 | 0.0000000 | Cellular Senescence |
| 74 | R-HSA-2559586 | 6 | 0.166339581526184 | 3.69015347631317e-32 | 0.0000000 | DNA Damage/Telomere Stress Induced Senescence |
| 152 | R-HSA-6791312 | 6 | 0.160307590148722 | 5.91855928379218e-32 | 0.0000000 | TP53 Regulates Transcription of Cell Cycle Genes |
| 63 | R-HSA-2219528 | 6 | 0.150273201542753 | 1.96024379213766e-31 | 0.0000000 | PI3K/AKT Signaling in Cancer |
| 161 | R-HSA-69206 | 6 | 0.104638891204408 | 1.93374648134732e-31 | 0.0000000 | G1/S Transition |
| 160 | R-HSA-69202 | 6 | 0.1044133774531 | 5.09471406738826e-31 | 0.0000000 | Cyclin E associated events during G1/S transition |
| 108 | R-HSA-5218920 | 6 | 0.188383693020523 | 3.65783580166394e-29 | 0.0000000 | VEGFR2 mediated vascular permeability |
| 66 | R-HSA-2262752 | 6 | 0.109931542668385 | 5.155560602407e-29 | 0.0000000 | Cellular responses to stress |
| 170 | R-HSA-69656 | 6 | 0.104638891204408 | 7.85386487154019e-28 | 0.0000000 | Cyclin A:Cdk2-associated events at S phase entry |
| 164 | R-HSA-69242 | 6 | 0.104638891204408 | 1.10995848521904e-27 | 0.0000000 | S Phase |
| 171 | R-HSA-8848021 | 6 | 0.131396735988981 | 1.74865525581942e-27 | 0.0000000 | Signaling by PTK6 |
| 154 | R-HSA-6804757 | 6 | 0.152256513681293 | 2.03704657428087e-27 | 0.0000000 | Regulation of TP53 Degradation |
| 157 | R-HSA-6806003 | 6 | 0.152256513681293 | 2.03704657428087e-27 | 0.0000000 | Regulation of TP53 Expression and Degradation |
| 11 | R-HSA-1257604 | 6 | 0.131329721531884 | 5.78919514625642e-27 | 0.0000000 | PIP3 activates AKT signaling |
| 42 | R-HSA-180292 | 6 | 0.131329721531884 | 5.78919514625642e-27 | 0.0000000 | GAB1 signalosome |
| 77 | R-HSA-2730905 | 6 | 0.126422234964336 | 1.22131426717467e-26 | 0.0000000 | Role of LAT2/NTAL/LAB on calcium mobilization |
| 57 | R-HSA-198203 | 6 | 0.125586371099449 | 1.52398893332487e-26 | 0.0000000 | PI3K/AKT activation |
| 4 | R-HSA-109704 | 6 | 0.176736865803021 | 2.67330922565291e-26 | 0.0000000 | PI3K Cascade |
| 89 | R-HSA-389356 | 6 | 0.175049474706952 | 1.22657370453753e-25 | 0.0000000 | CD28 co-stimulation |
| 88 | R-HSA-388841 | 6 | 0.152256513681293 | 1.39034856023611e-24 | 0.0000000 | Costimulation by the CD28 family |
| 59 | R-HSA-199418 | 6 | 0.131329721531884 | 4.4481937945955e-24 | 0.0000000 | Negative regulation of the PI3K/AKT network |
| 91 | R-HSA-390466 | 6 | 0.161176060591935 | 2.38254362776208e-23 | 0.0000000 | Chaperonin-mediated protein folding |
| 93 | R-HSA-391251 | 6 | 0.161176060591935 | 2.38254362776208e-23 | 0.0000000 | Protein folding |
| 92 | R-HSA-390471 | 6 | 0.162420089177491 | 4.53463481111177e-23 | 0.0000000 | Association of TriC/CCT with target proteins during biosynthesis |
| 162 | R-HSA-69231 | 6 | 0.133655680740526 | 1.07920078678361e-22 | 0.0000000 | Cyclin D associated events in G1 |
| 163 | R-HSA-69236 | 6 | 0.133655680740526 | 1.07920078678361e-22 | 0.0000000 | G1 Phase |
| 175 | R-HSA-8863795 | 6 | 0.196478357886946 | 1.08088925175971e-22 | 0.0000000 | Downregulation of ERBB2 signaling |
| 6 | R-HSA-1168372 | 6 | 0.104789193119891 | 3.65454879046344e-22 | 0.0000000 | Downstream signaling events of B Cell Receptor (BCR) |
| 159 | R-HSA-6811558 | 6 | 0.164728372885845 | 7.07868823477374e-22 | 0.0000000 | PI5P, PP2A and IER3 Regulate PI3K/AKT Signaling |
| 128 | R-HSA-5654726 | 6 | 0.2086584780258 | 7.3486168950305e-22 | 0.0000000 | Negative regulation of FGFR1 signaling |
| 169 | R-HSA-69620 | 6 | 0.104789193119891 | 1.45387755853743e-21 | 0.0000000 | Cell Cycle Checkpoints |
| 186 | R-HSA-983705 | 6 | 0.1044133774531 | 1.84065076725682e-21 | 0.0000000 | Signaling by the B Cell Receptor (BCR) |
| 65 | R-HSA-2219530 | 6 | 0.175049474706952 | 1.9785732331926e-21 | 0.0000000 | Constitutive Signaling by Aberrant PI3K in Cancer |
| 90 | R-HSA-389357 | 6 | 0.178271626601345 | 2.59500038252183e-21 | 0.0000000 | CD28 dependent PI3K/Akt signaling |
| 81 | R-HSA-373760 | 6 | 0.196478357886946 | 1.3678444649737e-20 | 0.0000000 | L1CAM interactions |
| 85 | R-HSA-3769402 | 6 | 0.110052628743431 | 5.41203631730405e-20 | 0.0000000 | Deactivation of the beta-catenin transactivating complex |
| 130 | R-HSA-5654727 | 6 | 0.203673171891168 | 5.74292148324212e-19 | 0.0000000 | Negative regulation of FGFR2 signaling |
| 114 | R-HSA-5654689 | 6 | 0.203382889124788 | 1.28241269803462e-18 | 0.0000000 | PI-3K cascade:FGFR1 |
| 2 | R-HSA-109606 | 6 | 0.194838125600575 | 4.69939941267943e-17 | 0.0000000 | Intrinsic Pathway for Apoptosis |
| 68 | R-HSA-2559580 | 6 | 0.138069351667405 | 1.31157318262041e-16 | 0.0000000 | Oxidative Stress Induced Senescence |
| 86 | R-HSA-381340 | 6 | 0.203403580004792 | 2.45547083713807e-16 | 0.0000000 | Transcriptional regulation of white adipocyte differentiation |
| 73 | R-HSA-2559585 | 6 | 0.155125189539284 | 6.91045715748456e-16 | 0.0000000 | Oncogene Induced Senescence |
| 116 | R-HSA-5654695 | 6 | 0.184196567360406 | 6.86477052917807e-16 | 0.0000000 | PI-3K cascade:FGFR2 |
| 115 | R-HSA-5654693 | 6 | 0.189914838091013 | 9.19648058433477e-16 | 0.0000000 | FRS-mediated FGFR1 signaling |
| 37 | R-HSA-169893 | 6 | 0.103449044970857 | 1.74590323859923e-15 | 0.0000000 | Prolonged ERK activation events |
| 38 | R-HSA-170968 | 6 | 0.103449044970857 | 1.74590323859923e-15 | 0.0000000 | Frs2-mediated activation |
| 47 | R-HSA-187687 | 6 | 0.103449044970857 | 1.74590323859923e-15 | 0.0000000 | Signalling to ERKs |
| 131 | R-HSA-5654732 | 6 | 0.209784534529589 | 2.0339572413955e-15 | 0.0000000 | Negative regulation of FGFR3 signaling |
| 133 | R-HSA-5654733 | 6 | 0.205949436803472 | 2.17759628201591e-15 | 0.0000000 | Negative regulation of FGFR4 signaling |
| 5 | R-HSA-112412 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | SOS-mediated signalling |
| 20 | R-HSA-167044 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | Signalling to RAS |
| 39 | R-HSA-170984 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | ARMS-mediated activation |
| 40 | R-HSA-179812 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | GRB2 events in EGFR signaling |
| 43 | R-HSA-180336 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | SHC1 events in EGFR signaling |
| 48 | R-HSA-187706 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | Signalling to p38 via RIT and RIN |
| 82 | R-HSA-375165 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | NCAM signaling for neurite out-growth |
| 142 | R-HSA-5673001 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | RAF/MAP kinase cascade |
| 143 | R-HSA-5684996 | 6 | 0.103086223428299 | 2.91192468576799e-15 | 0.0000000 | MAPK1/MAPK3 signaling |
| 49 | R-HSA-190236 | 6 | 0.164505116679766 | 3.44407375719912e-15 | 0.0000000 | Signaling by FGFR |
| 7 | R-HSA-1226099 | 6 | 0.15936742005519 | 1.84051795469108e-14 | 0.0000000 | Signaling by FGFR in disease |
| 8 | R-HSA-1227986 | 6 | 0.114519753596792 | 4.48987378349295e-14 | 0.0000000 | Signaling by ERBB2 |
| 46 | R-HSA-1839124 | 6 | 0.195465032607908 | 4.73210303501712e-14 | 0.0000000 | FGFR1 mutant receptor activation |
| 165 | R-HSA-69541 | 6 | 0.178069995058495 | 1.00844409703753e-13 | 0.0000000 | Stabilization of p53 |
| 1 | R-HSA-109581 | 6 | 0.152179607398936 | 1.04587017896073e-13 | 0.0000000 | Apoptosis |
| 109 | R-HSA-5357801 | 6 | 0.152179607398936 | 1.04587017896073e-13 | 0.0000000 | Programmed Cell Death |
| 112 | R-HSA-5654687 | 6 | 0.17619060686178 | 2.95059280580149e-13 | 0.0000000 | Downstream signaling of activated FGFR1 |
| 119 | R-HSA-5654700 | 6 | 0.181662107096329 | 3.10696235534376e-13 | 0.0000000 | FRS-mediated FGFR2 signaling |
| 134 | R-HSA-5654736 | 6 | 0.167177065945052 | 3.65835612790317e-13 | 0.0000000 | Signaling by FGFR1 |
| 156 | R-HSA-6804760 | 6 | 0.17619060686178 | 6.42480315821795e-13 | 0.0000000 | Regulation of TP53 Activity through Methylation |
| 153 | R-HSA-6804114 | 6 | 0.178814329233032 | 6.94727708538222e-13 | 0.0000000 | TP53 Regulates Transcription of Genes Involved in G2 Cell Cycle Arrest |
| 50 | R-HSA-1912408 | 6 | 0.183173874985823 | 9.29576388871165e-13 | 0.0000000 | Pre-NOTCH Transcription and Translation |
| 155 | R-HSA-6804758 | 6 | 0.171804221242074 | 1.03447448986687e-12 | 0.0000000 | Regulation of TP53 Activity through Acetylation |
| 113 | R-HSA-5654688 | 6 | 0.196542214235157 | 1.09592538810522e-12 | 0.0000000 | SHC-mediated cascade:FGFR1 |
| 127 | R-HSA-5654720 | 6 | 0.184522477916402 | 1.71266396916768e-12 | 0.0000000 | PI-3K cascade:FGFR4 |
| 139 | R-HSA-5655302 | 6 | 0.164393371544755 | 1.85418664449361e-12 | 0.0000000 | Signaling by FGFR1 in disease |
| 123 | R-HSA-5654710 | 6 | 0.193048637478027 | 1.90557279403111e-12 | 0.0000000 | PI-3K cascade:FGFR3 |
| 70 | R-HSA-2559582 | 6 | 0.136435102280291 | 1.9278171960276e-12 | 0.0000000 | Senescence-Associated Secretory Phenotype (SASP) |
| 51 | R-HSA-1912422 | 6 | 0.181816483781491 | 3.75757630854625e-12 | 0.0000000 | Pre-NOTCH Expression and Processing |
| 94 | R-HSA-392451 | 6 | 0.102391778690985 | 6.31942545786839e-12 | 0.0000000 | G beta:gamma signalling through PI3Kgamma |
| 99 | R-HSA-397795 | 6 | 0.102391778690985 | 6.31942545786839e-12 | 0.0000000 | G-protein beta:gamma signalling |
| 79 | R-HSA-373755 | 6 | 0.163316717089663 | 1.73187541764886e-11 | 0.0000000 | Semaphorin interactions |
| 117 | R-HSA-5654696 | 6 | 0.149982443799756 | 5.9825590348103e-11 | 0.0000000 | Downstream signaling of activated FGFR2 |
| 138 | R-HSA-5655253 | 6 | 0.149982443799756 | 5.9825590348103e-11 | 0.0000000 | Signaling by FGFR2 in disease |
| 135 | R-HSA-5654738 | 6 | 0.141114922885457 | 6.86563006005631e-11 | 0.0000000 | Signaling by FGFR2 |
| 83 | R-HSA-376176 | 6 | 0.183623594979209 | 2.14287569338875e-10 | 0.0000000 | Signaling by Robo receptor |
| 124 | R-HSA-5654712 | 6 | 0.168759366581101 | 4.46942232901848e-10 | 0.0000000 | FRS-mediated FGFR4 signaling |
| 19 | R-HSA-166166 | 6 | 0.13505549202586 | 5.00342335673692e-10 | 0.0000000 | MyD88-independent TLR3/TLR4 cascade |
| 26 | R-HSA-168164 | 6 | 0.13505549202586 | 5.00342335673692e-10 | 0.0000000 | Toll Like Receptor 3 (TLR3) Cascade |
| 178 | R-HSA-937061 | 6 | 0.13505549202586 | 5.00342335673692e-10 | 0.0000000 | TRIF-mediated TLR3/TLR4 signaling |
| 121 | R-HSA-5654706 | 6 | 0.17523692301412 | 5.82506066227636e-10 | 0.0000000 | FRS-mediated FGFR3 signaling |
| 15 | R-HSA-166054 | 6 | 0.13505549202586 | 7.40306092069974e-10 | 0.0000000 | Activated TLR4 signalling |
| 22 | R-HSA-168138 | 6 | 0.13505549202586 | 7.40306092069974e-10 | 0.0000000 | Toll Like Receptor 9 (TLR9) Cascade |
| 32 | R-HSA-168181 | 6 | 0.13505549202586 | 7.40306092069974e-10 | 0.0000000 | Toll Like Receptor 7/8 (TLR7/8) Cascade |
| 182 | R-HSA-975155 | 6 | 0.13505549202586 | 7.40306092069974e-10 | 0.0000000 | MyD88 dependent cascade initiated on endosome |
| 98 | R-HSA-3928665 | 6 | 0.192854526292227 | 8.22835218472964e-10 | 0.0000000 | EPH-ephrin mediated repulsion of cells |
| 17 | R-HSA-166058 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | MyD88:Mal cascade initiated on plasma membrane |
| 24 | R-HSA-168142 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | Toll Like Receptor 10 (TLR10) Cascade |
| 28 | R-HSA-168176 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | Toll Like Receptor 5 (TLR5) Cascade |
| 30 | R-HSA-168179 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | Toll Like Receptor TLR1:TLR2 Cascade |
| 34 | R-HSA-168188 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | Toll Like Receptor TLR6:TLR2 Cascade |
| 45 | R-HSA-181438 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | Toll Like Receptor 2 (TLR2) Cascade |
| 180 | R-HSA-975138 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | TRAF6 mediated induction of NFkB and MAP kinases upon TLR7/8 or 9 activation |
| 184 | R-HSA-975871 | 6 | 0.143248979847831 | 1.05188570113004e-09 | 0.0000000 | MyD88 cascade initiated on plasma membrane |
| 103 | R-HSA-450294 | 6 | 0.162973566722587 | 1.22288862099984e-09 | 0.0000000 | MAP kinase activation in TLR cascade |
| 75 | R-HSA-2682334 | 6 | 0.106234165176346 | 1.7091420198163e-09 | 0.0000000 | EPH-Ephrin signaling |
| 61 | R-HSA-2029482 | 6 | 0.168197627999717 | 6.90855391597471e-09 | 0.0000000 | Regulation of actin dynamics for phagocytic cup formation |
| 13 | R-HSA-166016 | 6 | 0.117870810743105 | 9.55211957596783e-09 | 0.0000000 | Toll Like Receptor 4 (TLR4) Cascade |
| 36 | R-HSA-168898 | 6 | 0.117870810743105 | 9.55211957596783e-09 | 0.0000000 | Toll-Like Receptors Cascades |
| 78 | R-HSA-3214858 | 6 | 0.139322757701264 | 9.45086483596051e-09 | 0.0000000 | RMTs methylate histone arginines |
| 55 | R-HSA-1963642 | 6 | 0.204442546887885 | 1.54713345516581e-08 | 0.0000000 | PI3K events in ERBB2 signaling |
| 118 | R-HSA-5654699 | 6 | 0.180040793554223 | 2.05058839489934e-08 | 0.0000000 | SHC-mediated cascade:FGFR2 |
| 185 | R-HSA-983231 | 6 | 0.101187314067097 | 3.78204621566802e-08 | 0.0000001 | Factors involved in megakaryocyte development and platelet production |
| 125 | R-HSA-5654716 | 6 | 0.112753083450221 | 4.46002952435291e-08 | 0.0000001 | Downstream signaling of activated FGFR4 |
| 137 | R-HSA-5654743 | 6 | 0.111928700819091 | 4.68240717581215e-08 | 0.0000001 | Signaling by FGFR4 |
| 122 | R-HSA-5654708 | 6 | 0.126382655875272 | 6.38398344004582e-08 | 0.0000001 | Downstream signaling of activated FGFR3 |
| 140 | R-HSA-5655332 | 6 | 0.126382655875272 | 6.38398344004582e-08 | 0.0000001 | Signaling by FGFR3 in disease |
| 174 | R-HSA-8853338 | 6 | 0.126382655875272 | 6.38398344004582e-08 | 0.0000001 | Signaling by FGFR3 point mutants in cancer |
| 136 | R-HSA-5654741 | 6 | 0.112753083450221 | 6.83794452415392e-08 | 0.0000001 | Signaling by FGFR3 |
| 95 | R-HSA-3928662 | 6 | 0.150416046395665 | 1.73754967513525e-07 | 0.0000002 | EPHB-mediated forward signaling |
| 96 | R-HSA-3928663 | 6 | 0.110898529367959 | 1.87549038498414e-07 | 0.0000002 | EPHA-mediated growth cone collapse |
| 9 | R-HSA-1250196 | 6 | 0.172423031278246 | 9.6053556097146e-07 | 0.0000011 | SHC1 events in ERBB2 signaling |
| 53 | R-HSA-1963640 | 6 | 0.172423031278246 | 9.6053556097146e-07 | 0.0000011 | GRB2 events in ERBB2 signaling |
| 110 | R-HSA-5617472 | 6 | 0.13365088429017 | 9.8655906620644e-07 | 0.0000011 | Activation of anterior HOX genes in hindbrain development during early embryogenesis |
| 111 | R-HSA-5619507 | 6 | 0.13365088429017 | 9.8655906620644e-07 | 0.0000011 | Activation of HOX genes during differentiation |
| 101 | R-HSA-400685 | 6 | 0.128774113154716 | 1.43897006120777e-06 | 0.0000016 | Sema4D in semaphorin signaling |
| 173 | R-HSA-8851805 | 6 | 0.117444747747302 | 2.68336319428933e-05 | 0.0000295 | MET activates RAS signaling |
| 126 | R-HSA-5654719 | 6 | 0.137323920383 | 3.68249172347517e-05 | 0.0000403 | SHC-mediated cascade:FGFR4 |
| 120 | R-HSA-5654704 | 6 | 0.147967798750578 | 4.89863113928989e-05 | 0.0000533 | SHC-mediated cascade:FGFR3 |
| 106 | R-HSA-453279 | 7 | 0.110889810549242 | 7.50929025234752e-20 | 0.0000000 | Mitotic G1-G1/S phases |
| 151 | R-HSA-6785807 | 7 | 0.105402843779737 | 1.93583573871198e-18 | 0.0000000 | Interleukin-4 and 13 signaling |
| 52 | R-HSA-195721 | 7 | 0.110146357993067 | 4.26885062439873e-17 | 0.0000000 | Signaling by Wnt |
| 60 | R-HSA-201681 | 7 | 0.109536252555609 | 8.12155889253289e-17 | 0.0000000 | TCF dependent signaling in response to WNT |
| 105 | R-HSA-452723 | 7 | 0.103223636213854 | 4.5266594431579e-11 | 0.0000000 | Transcriptional regulation of pluripotent stem cells |
driver_genes<-c("NF1","RB1","BRCA1","BRCA2","CDK12","CCNE1","PTEN","MYC","PIK3CA")
d_mut_ratio_info<-c()
all_exposures_mut_wt_gene<-c()
for(g in driver_genes)
{
mutated<-mut_matrix[,g]
for(s in rownames(all_sig))
{
mut<-as.numeric(all_sig[s,rownames(mut_matrix)[mutated]])
nonmut<-as.numeric(all_sig[s,rownames(mut_matrix)[!mutated]])
all_exposures_mut_wt_gene<-rbind(all_exposures_mut_wt_gene,rbind(cbind(s,g,"mut",mut),cbind(s,g,"wt",nonmut)))
if(median(mut)>median(nonmut))
{
d_mut_ratio_info<-rbind(d_mut_ratio_info,c(g,s,median(mut)-median(nonmut),wilcox.test(mut,nonmut,alternative="greater")$p.value))
}
}
}
colnames(d_mut_ratio_info)<-c("Gene","Signature","median_diff","pvalue")
d_mut_ratio_info<-data.frame(d_mut_ratio_info,stringsAsFactors = F)
d_mut_ratio_info<-d_mut_ratio_info[d_mut_ratio_info$median_diff>=0.07,]
d_mut_ratio_info$adj.pvalue<-p.adjust(d_mut_ratio_info$pvalue)
d_mut_ratio_info<-d_mut_ratio_info[d_mut_ratio_info$adj.pvalue<=0.05,]
knitr::kable(d_mut_ratio_info) %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "300px")
| Gene | Signature | median_diff | pvalue | adj.pvalue | |
|---|---|---|---|---|---|
| 1 | NF1 | s1 | 0.262179436462879 | 0.00434334955231847 | 0.0304034 |
| 7 | BRCA1 | s3 | 0.200486753115807 | 0.00899089052048452 | 0.0449545 |
| 10 | BRCA2 | s3 | 0.242551077528908 | 0.00542057060793141 | 0.0325234 |
| 11 | CDK12 | s2 | 0.119981131001674 | 0.000227495729328335 | 0.0020475 |
| 12 | CDK12 | s4 | 0.179205973812764 | 8.68644521119764e-05 | 0.0008686 |
| 15 | CCNE1 | s4 | 0.0852403555464706 | 1.76266741440831e-11 | 0.0000000 |
| 16 | CCNE1 | s6 | 0.179899918038168 | 7.39094550209736e-27 | 0.0000000 |
| 18 | PTEN | s3 | 0.338255799041865 | 0.000248900414623073 | 0.0020475 |
| 20 | MYC | s4 | 0.100097994114185 | 7.17013743046833e-13 | 0.0000000 |
| 22 | MYC | s7 | 0.0805943170357554 | 9.70124826320499e-08 | 0.0000011 |
selected_path<-read.table("data/pathway_results_filtered.csv",header = T,sep="\t",stringsAsFactors = F)
selected_pathways<-mut_ratio_info[mut_ratio_info$Type%in%selected_path$Type,c(1,6,2,3,4,5,7)]
CGImuts<-readRDS("data/CGI_mut_role.rds")
CGIcna<-readRDS("data/CGI_cna_role.rds")
#manual summary results
gene_roles<-c()
for(i in 1:nrow(selected_pathways))
{
mut<-CGImuts[CGImuts$gene%in%unlist(strsplit(selected_pathways[i,"geneID"],"/")),]
cna<-CGIcna[CGIcna$gene%in%unlist(strsplit(selected_pathways[i,"geneID"],"/")),]
gene_roles<-rbind(gene_roles,cbind(selected_pathways[i,c(1:2)],mut))
if(nrow(cna)>1)
{
colnames(cna)<-colnames(mut)
gene_roles<-rbind(gene_roles,cbind(selected_pathways[i,c(1:2)],cna))
}
}
gene_roles<-gene_roles[order(gene_roles[,1],gene_roles[,2],gene_roles[,4]),]
all_diff_exp<-selected_pathways
#generate pathway mutation plots
pdat<-c()
for(p in unique(selected_pathways$Type))
{
desc<-unique(selected_pathways[selected_pathways$Type==p,"Description"])
curr_genes<-pathway_genes[[p]]
for(g in curr_genes)
{
m<-CGImuts[CGImuts$gene==g,c("sample","gene","consequence")]
m<-m[!duplicated(m$sample),]
c<-CGIcna[CGIcna$gene==g,c("sample","gene","cna")]
c<-c[!duplicated(c$sample),]
colnames(c)<-c("sample","gene","consequence")
if(nrow(m)>0)
{
pdat<-rbind(pdat,cbind(desc,m))
}
if(nrow(c)>0)
{
pdat<-rbind(pdat,cbind(desc,c))
}
}
}
pdat<-pdat[!pdat$gene=="TP53",]
ord<-pdat %>% group_by(gene) %>% summarise(count=n()) %>% arrange(count)
pdat$gene<-factor(pdat$gene,levels=ord$gene)
ggplot(pdat,aes(x=gene,fill=consequence))+
geom_bar(stat="count",position = "stack")+
facet_grid(desc ~ .,scales="free",space="free")+
coord_flip()+my_theme+theme(strip.text.y = element_text(angle = 0),axis.text.y=element_text(face="italic"))+scale_fill_manual(values=cbPalette)+xlab("Driver genes")+ylab("Mutated cases")
mut_path_count<-reshape2::melt(table(pdat[,c(1,3)]))
mut_path_count<-mut_path_count[mut_path_count$value>0,]
pdat<-all_exposures_mut_wt[all_exposures_mut_wt$Description%in%selected_pathways$Description,]
pdat$highlight<-FALSE
pdat$highlight[pdat$Signature%in%c("s4","s6")&pdat$Description=="PI3K/AKT Signaling in Cancer"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s6")&pdat$Description=="Toll Like Receptor 3 (TLR3) Cascade"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s7")&pdat$Description=="Signaling by Wnt"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s1")&pdat$Description=="Regulation of RAS by GAPs"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s7")&pdat$Description=="Interleukin-4 and 13 signaling"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s6")&pdat$Description=="Cyclin E associated events during G1/S transition "]<-TRUE
pdat$highlight[pdat$Signature%in%c("s6")&pdat$Description=="Cellular Senescence"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Description=="HDR through Homologous Recombination (HRR)"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s6")&pdat$Description=="Cyclin D associated events in G1"]<-TRUE
pdat$Description<-plyr::revalue(pdat$Description,c(
"PI3K/AKT Signaling in Cancer"="PI3K/AKT\nsignaling",
"Cyclin E associated events during G1/S transition "="Cyclin E\nevents\nin G1/S",
"Interleukin-4 and 13 signaling"="Interleukin-4\nand 13\nsignaling",
"Regulation of RAS by GAPs"="RAS\nsignaling",
"Signaling by Wnt"="Wnt\nsignaling",
"Cyclin D associated events in G1"="Cyclin D\nevents in G1",
"Toll Like Receptor 3 (TLR3) Cascade"="Toll-Like\nReceptors\nCascades",
"HDR through Homologous Recombination (HRR)"="Homologous\nrecombination"))
pdat$Signature<-plyr::revalue(pdat$Signature,
c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat,aes(x=Status,y=as.numeric(Exposure)))+
geom_rect(data = pdat,aes(fill = highlight==TRUE),xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf,alpha = 0.1)+
geom_boxplot(outlier.size = 0.5,outlier.alpha = 0.3,varwidth = T,lwd=0.25)+facet_grid(Signature~Description)+
theme_bw()+theme(axis.text=element_text(size=6),axis.title=element_text(size=6),legend.position = "none",
strip.text = element_text(size = 5))+ylab("Exposure")+coord_cartesian(ylim=c(0,0.7))+
scale_fill_manual(values=c("white","grey"))
p
pdat<-data.frame(all_exposures_mut_wt_gene,stringsAsFactors = F)
colnames(pdat)<-c("Signature","Gene","Status","Exposure")
pdat$highlight<-FALSE
pdat$highlight[pdat$Signature%in%c("s1")&pdat$Gene=="NF1"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Gene=="BRCA1"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Gene=="BRCA2"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s2","s4")&pdat$Gene=="CDK12"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s6")&pdat$Gene=="CCNE1"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s7")&pdat$Gene=="MYC"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Gene=="PTEN"]<-TRUE
p<-ggplot(pdat,aes(x=Status,y=as.numeric(Exposure)))+
geom_rect(data = pdat,aes(fill = highlight==TRUE),xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf,alpha = 0.1)+
geom_boxplot(outlier.size = 0.5,outlier.alpha = 0.3,varwidth = T,lwd=0.25)+facet_grid(Signature~Gene)+
theme_bw()+theme(axis.text=element_text(size=6),axis.title=element_text(size=6),legend.position = "none",
strip.text = element_text(size = 5),strip.text.x=element_text(face="italic"))+ylab("Exposure")+#coord_cartesian(ylim=c(0,0.7))+
scale_fill_manual(values=c("white","grey"))
p
This section describes how the trinuleotide motif matrices were extracted from deepWGS Britoc-1 and PCAWG samples using the SomaticSignatures R package and mutational signature exposures extracted using the deconstructSigs R package.
The VCF files can be obtained by writing to the relevant Data Access Committees and placed in data/britroc/snv/ and restricted_data/pcawg_snv/ before running the code below. The human genome reference files used in each cohort should be placed in data/reference/britroc/GRCh37.fa and restricted_data/reference/pcawg/genome.fa.
The motif matrix files derived from the SNV calls used to generate the COSMIC mutational signature exposures of both cohorts are provided.
suppressMessages(library(dplyr))
suppressMessages(library(SomaticSignatures))
suppressMessages(library(deconstructSigs))
suppressMessages(library(NMF))
# Helper functions to extract information from VCF files
# Filter variants passing all filters in the VCF file
filterPass <- function(vrobj){
if(dim(softFilterMatrix(vrobj))[2] !=0){
nfilters <- dim(softFilterMatrix(vrobj))[2]
vr <- vrobj[which(rowSums(softFilterMatrix(vrobj)) ==nfilters)]
}else{
vr <- vrobj
}
return(vr)
}
# Extract trinulceotide frequencies from VCF file
createMotifMatrix <- function(vcffile){
s <- samples(scanVcfHeader(vcffile))
s <- s[grep("tumor",s)] # select the tumor sample from the vcf which contains both the tumour and matched normal variants
vr_in <- readVcfAsVRanges(vcffile, "hg19", param = ScanVcfParam(info=NA, samples=s,geno="GT"))
vr <- filterPass(vr_in)
motifs <- mutationContext(vr,ref_fasta)
mm <- motifMatrix(motifs)
return(mm)
}
# Extract mutational signature exposures
getWeights <- function(motifmatrix,signature){
# convert to format required by deconstructSigs
mm <- data.frame(t(motifmatrix))
# use the column names in the deconstructSigs signatures dataframe
colnames(mm) <- colnames(signature)
wt_df <- data.frame()
for(s in rownames(mm)){
ws <- whichSignatures(tumor.ref = mm, signatures.ref = signature, sample.id = s)
#add unknown component
Unknown <- ws$unknown
wt_df <- rbind(wt_df,cbind(ws$weights,Unknown))
}
return(wt_df)
}
The code below is to generate the motif matrix for the Britroc cohort from the original snv calls.
#Extract trinucleotide frequencies from the BritROC-1 SNV calls
# Subset to list of samples passing QC
samplestatus <- read.csv("data/britroc_deepWGS_sample_status.csv", as.is=T)
trialno <- unique(samplestatus$britroc_ID[samplestatus$status=="final"])
# Create FaFile object from Reference
ref_fasta <- FaFile("data/reference/britroc/GRCh37.fa")
# Create motif matrix from deepWGS BriTROC-1 data
vcfpath <- "data/britroc/snv/"
pat <- "snv.filters.blacklist.vcf" # suffix of VCF files to be processed
vcflist <- list.files(vcfpath, full.names=TRUE, pattern=pat)
# select samples passing QC
pat <- gsub("[$]", "", pat)
vcf_in <- paste0(vcfpath,"/", trialno, ".", pat)
# Select only files that exist
vcf_in <- vcf_in[vcf_in %in% vcflist]
motifmat_list <- lapply(vcf_in, createMotifMatrix)
motifmat <- do.call(cbind.data.frame, motifmat_list)
write.table(motifmat, "data/britroc_deepWGS_snv_motif_matrix.txt", sep = "\t", quote=F)
cosmic_weights <- getWeights(motifmat,signatures.cosmic)
cosmic_out <- cbind.data.frame(row.names(cosmic_weights), cosmic_weights, stringsAsFactors=F)
colnames(cosmic_out)[1] <- "Sample"
cosmic_outfile <- "data/britroc_cosmic_weights.txt"
write.table(cosmic_out, file = cosmic_outfile, row.names = F, quote = F,sep="\t")
The code below is to get mutational signature exposures for the Britroc cohort from the motif matrix file generated above.
# Extract weights of COSMIC mutational signatures from the BriTROC-1 cohort
motifmat <- read.table("data/britroc_deepWGS_snv_motif_matrix.txt", sep="\t", header=T)
cosmic_weights <- getWeights(motifmat,signatures.cosmic)
rownames(cosmic_weights) <- gsub("X","",rownames(cosmic_weights))
rownames(cosmic_weights) <- gsub("[.]","-",rownames(cosmic_weights))
cosmic_out <- cbind.data.frame(row.names(cosmic_weights), cosmic_weights, stringsAsFactors=F)
colnames(cosmic_out)[1] <- "Sample"
cosmic_outfile <- "data/britroc_cosmic_weights.txt"
write.table(cosmic_out, file = cosmic_outfile, row.names = F, quote = F,sep="\t")
The code below is to generate the motif matrix for the PCAWG cohort from the original snv calls.
# Extract trinucleotide frequencies from the PCAWG (OV-AU, OV-US) SNV calls
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
# Consensus SNV data must be placed in the following directory and files must have the
# .consensus.20160830.somatic.snv_mnv.vcf.gz extension
snvdir <-"restricted_data/pcawg_snv/"
ext <- ".consensus.20160830.somatic.snv_mnv.vcf.gz"
pcawg_ov_filelist <- paste0(snvdir,pcawg_ids, ext)
pcawg_ref <- FaFile("restricted_data/reference/pcawg/genome.fa")
pcawg_vranges <- lapply(pcawg_ov_filelist, readVcfAsVRanges, param = ScanVcfParam(info=NA))
pcawg_vranges <- lapply(
seq_along(pcawg_vranges), function(v, n, i) {
sampleNames(v[[i]]) = n[i]; v[[i]]
},
v = pcawg_vranges,
n = pcawg_ids
)
pcawg_mc <- lapply(pcawg_vranges, mutationContext, pcawg_ref)
pcawg_mm <- data.frame(lapply(pcawg_mc, motifMatrix))
write.table("data/pcawg_snv_motif_matrix.txt", sep="\t", quote=F)
pcawg_cosmic_weights <- getWeights(pcawg_mm, signatures.cosmic)
pcawg_samplenames <- sub("X", "",row.names(pcawg_cosmic_weights))
pcawg_samplenames <- gsub("[.]", "-", pcawg_samplenames)
pcawg_out <- cbind.data.frame(pcawg_samplenames,pcawg_cosmic_weights)
colnames(pcawg_out)[1] <- "Sample"
pcawg_cosmic_outfile <- "data/pcawg_cosmic_weights.txt"
write.table(pcawg_out, file = pcawg_cosmic_outfile, row.names=F, quote = F, sep="\t")
The code below is to get mutational signature exposures for the PCAWG cohort from the motif matrix file (pcawg_mm) generated above.
# Extract weights of COSMIC mutational signatures
pcawg_mm <- read.table("data/pcawg_snv_motif_matrix.txt", as.is=T, header=T, sep="\t")
pcawg_cosmic_weights <- getWeights(pcawg_mm, signatures.cosmic)
pcawg_samplenames <- sub("X", "",row.names(pcawg_cosmic_weights))
pcawg_samplenames <- gsub("[.]", "-", pcawg_samplenames)
pcawg_out <- cbind.data.frame(pcawg_samplenames,pcawg_cosmic_weights, stringsAsFactors=F)
colnames(pcawg_out)[1] <- "Sample"
pcawg_cosmic_outfile <- "data/pcawg_cosmic_weights.txt"
write.table(pcawg_out, file = pcawg_cosmic_outfile, row.names=F, quote = F,sep="\t")
pcawg_cosmic_weights = read.table(file = "data/pcawg_cosmic_weights.txt",as.is = T, header=T,row.names = 1)
britroc_cosmic_weights = read.table(file = "data/britroc_cosmic_weights.txt",as.is = T, header=T,row.names = 1)
rownames(britroc_cosmic_weights)<-unlist(lapply(strsplit(rownames(britroc_cosmic_weights),"_"),"[[",3))
cosmic_weights<-rbind(pcawg_cosmic_weights,britroc_cosmic_weights)
cosmic_weights<-cosmic_weights[,colSums(cosmic_weights)>0]
SNV_sig<-cosmic_weights[rownames(cosmic_weights)%in%colnames(cbind(sig_pat_mat_britroc,sig_pat_mat_remain,sig_pat_mat_pcawg)),]
SNV_sig<-SNV_sig[,-ncol(SNV_sig)]#Remove unknown signature component
CN_sig<-cbind(sig_pat_mat_britroc,sig_pat_mat_remain,sig_pat_mat_pcawg)
CN_sig<-CN_sig[,colnames(CN_sig)%in%rownames(SNV_sig)]
CN_sig<-t(CN_sig)
CN_sig<-CN_sig[match(rownames(SNV_sig),rownames(CN_sig)),]
library(Hmisc)
SNV_CN_sig_cor<-rcorr(as.matrix(SNV_sig),as.matrix(CN_sig),type="pearson")
library(corrplot)
cormat<-SNV_CN_sig_cor$r[c(-22:-28),c(22:28)]
pval<-SNV_CN_sig_cor$P[c(-22:-28),c(22:28)]
qval<-p.adjust(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)],method="BH")
dim(qval)<-dim(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)])
rownames(qval)<-rownames(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)])
colnames(qval)<-colnames(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)])
#cormat[qval>=0.1]<-0
colnames(cormat)<-1:nsig
colnames(pval)<-1:nsig
colnames(qval)<-1:nsig
integrated_matrix<-cbind(reshape2::melt(cormat),reshape2::melt(pval)[,3],reshape2::melt(qval)[,3])
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),]
colnames(CN_sig)<-1:nsig
for(i in paste0("Signature.",c(1,3,5,13,16)))
{
for(s in 1:nsig)
{
all_cor_dat<-rbind(all_cor_dat,cbind(i,s,CN_sig[,s],SNV_sig[,i]))
all_cor_dat_cors<-rbind(all_cor_dat_cors,c(Signature=s,cor=cormat[i,s],Feature=i))
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,c(Signature=s,pval=qval[i,s],Feature=i))
}
}
rownames(all_cor_dat)<-NULL
all_cor_dat_cors<-data.frame(all_cor_dat_cors,stringsAsFactors = F)
all_cor_dat_pvals<-data.frame(all_cor_dat_pvals,stringsAsFactors = F)
knitr::kable(all_corrs) %>%
kableExtra::kable_styling(full_width = T)
| Variable | Signature | Correlation | pvalue | qvalue | |
|---|---|---|---|---|---|
| 3 | Signature.3 | 1 | -0.3126844 | 0.0000832 | 0.0024449 |
| 26 | Signature.5 | 2 | 0.2562713 | 0.0013864 | 0.0226439 |
| 43 | Signature.1 | 3 | -0.3829189 | 0.0000010 | 0.0000760 |
| 45 | Signature.3 | 3 | 0.3220786 | 0.0000491 | 0.0018038 |
| 97 | Signature.16 | 5 | -0.2564781 | 0.0013736 | 0.0226439 |
| 106 | Signature.1 | 6 | 0.4116913 | 0.0000001 | 0.0000183 |
| 108 | Signature.3 | 6 | -0.3494260 | 0.0000096 | 0.0004680 |
| 116 | Signature.13 | 6 | 0.2694577 | 0.0007566 | 0.0158890 |
| 129 | Signature.3 | 7 | 0.2903792 | 0.0002715 | 0.0066508 |
This section describes how the Tandem Duplicator Phenotype (TDP) score was calculated for PCAWG ovarian cancer samples (OV-AU, OV-US) using the method described in Menghi F, et al. (2016) The tandem duplicator phenotype as a distinct genomic configuration in cancer. Proc Natl Acad Sci USA 113(17):E2373–E2382.
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
# Helper functions to count the number of tandem duplications per chromosome, normalized by chromosome length.
chr_size <- read.table("../data/hg19.chrom.sizes.txt", sep = "\t", as.is=T)
colnames(chr_size) <- c("chr", "size")
chr_size$scale <- chr_size$size/sum(as.numeric(chr_size$size))
getTdCountPerChr <- function(filename_vec){
# Returns a dataframe of no. of duplication events per chromosome
# normalised by chromosome size, given a vector of filenames containing
# structural variant information
lapply (filename_vec, function(filename){
# dataframe to contain no. of tandem duplications per chromosome
td_chr_df <- cbind.data.frame(c(1:22,"X"), rep(0,23), stringsAsFactors=F)
colnames(td_chr_df) <- c("chr", "TD")
# read in structural variant file
sv <- read.table(filename, header=T, sep ="\t", as.is = T)
dup_counts <- sv %>%
filter(svclass == "DUP") %>%
group_by(chrom1) %>%
summarise(n_TD=n())
# update counts in the dataframe
td_chr_df$TD[match(dup_counts$chrom1,td_chr_df$chr)] <- dup_counts$n_TD
# parse sample name from filename and add to dataframe
string <- strsplit(filename,"/")[[1]][6]
sample_name <- strsplit(string1,"[.]")[[1]][1]
td_chr_df$Sample <- sample_name
# the normaliation/weight to apply to td count based on chromosome length
td_chr_df$wt <- chr_size$scale[match(td_chr_df$chr,chr_size$chr)]
td_chr_df
}
)
}
classifyTDP = function(df){
# Calculates TDP score and classifies samples into TDP or non-TDP classes.
# 0.71 (from Menghi et al.) was used as the threshold for classification
df %>%
group_by(Sample) %>%
mutate(exp = wt*sum(TD), oe=abs(TD-exp), oesum= sum(oe)) %>%
mutate(TDP.score.raw = round(-oesum/sum(TD),3)) %>%
select(Sample, TDP.score.raw) %>%
unique() %>%
mutate(TDP.score = TDP.score.raw+0.71) %>%
mutate(TDP.status = ifelse(TDP.score>0,"TDP", "NON-TDP"))
}
The structural variant files from PCAWG should be placed in restricted_data/pcawg_sv/ before running the code below.
# Calulate TDP score from structural variant calls from PCAWG (OV-AU, OV-US) samples.
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
path <- "restricted_data/pcawg_sv/"
ext <- ".pcawg_consensus_1.6.161116.somatic.sv.bedpe.gz"
pcawg_ov_filelist <- paste0(path, pcawg_ids, ext)
# calculate TDP Score and classify samples
results_ov <- getTdCountPerChr(pcawg_ov_filelist)
all_data_ov <- do.call(rbind,results_ov)
tdp_ov <- classifyTDP(all_data_ov)
write.table(tdp_ov, file = "data/pcawg_TDP_score.txt", sep="\t", quote=F, row.names = F)
TDPscores<-read.table("data/pcawg_TDP_score.txt",header=T,sep="\t",stringsAsFactors = F)
TDPscores<-TDPscores[TDPscores$Sample%in%colnames(sig_pat_mat_pcawg),]
pdat<-reshape2::melt(sig_pat_mat_pcawg)
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,TDPscores,by.x=2,by.y=1)
cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, TDP.score,method="s",exact=F)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, TDP.score,method="s",exact=F)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"TDP score"
pvals$Feature<-"TDP score"
integrated_matrix<-cbind("Tandem duplicator phenotype score",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])
pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(TDP.score)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = 8,scales="free_x")+
coord_cartesian(ylim = c(-1.2,1.2))+
geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=0.9, hjust=-0.05, vjust=0,parse = T)+
geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=0.55, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
strip.text.x = element_text(size = 7))
p
pvals$pval<-p.adjust(pvals$pval,method="BH")
all_cor_dat<-rbind(all_cor_dat,cbind("TDP score",pdat$Signature,pdat$Exposure,pdat$TDP.score))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)
This section describes how the copy-number (.spc) and translocation (.spt) input files were generated from the pcawg copy-number and structural variant datasets to be used with the Shatterproof software.
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
path <- "restricted_data/pcawg_sv/"
ext <- ".pcawg_consensus_1.6.161116.somatic.sv.bedpe.gz"
pcawg_ov_filelist <- paste0(path, pcawg_ids, ext)
pcawg_rds_idx <- which( names(pcawg_segTabs) %in% pcawg_ids)
# Column names for translocation dataframe
spt_colnames <- c("#chr1", "start", "end", "chr2", "start", "end", "quality")
# Column names for copy-number dataframe
spc_colnames <- c("#chr", "start", "end", "number", "quality")
for(sample_name in names(pcawg_segTabs)[pcawg_rds_idx]){
output_path <- paste0("shatterproof/sample_data/",sample_name,"/")
dir.create(output_path)
# read in structural variant file
sv_filename <- paste0(pcawg_fpath, sample_name, ext)
sv <- read.table(sv_filename, header=T, sep ="\t", as.is = T)
# translocations
sample_spt <- filter(sv, svclass=="TRA") %>%
select(chrom1,start1,end1, chrom2,start2,end2) %>%
mutate(quality=".", chrom1 = paste0("chr",chrom1), chrom2=paste0("chr",chrom2))
colnames(sample_spt) <- spt_colnames
write.table(sample_spt, file=paste0(output_path,sample_name,".spt"), row.names = F, quote = F, sep="\t")
# rounded absolute CN values
CN <- pcawg_segTabs[[sample_name]]
sample_spc <- mutate(CN, quality=".", segVal=round(as.numeric(segVal),0))
colnames(sample_spc) <- spc_colnames
write.table(sample_spc, file=paste0(output_path,sample_name,".spc"), row.names = F, quote = F, sep="\t")
}
Shatterproof was then run using the following perl command in the shell script runShatterproof.sh.
perl -w shatterproof.pl --cnv shatterproof/sample_data/ --trans shatterproof/sample_data/ --tp53 --config ./config.pl --output shatterproof/sample_data/output/
The final score per detected chromothripsis-like event per sample was extracted from the suspect_regions.yml files using the shell script get_final_score.sh.
# Read in table of combined Shatterproof scores across all samples
sp_scores <- read.table("data/pcawg_chromothripsis_scores.txt", as.is = T, header=T, sep="\t")
# Generate a list of events above 80th, 85th, 90th and 95th percentiles of shatterproof scores
perc_rank_scores_df <- sp_scores%>%
mutate(perc_rank = round(percent_rank(score),3)) %>%
filter(perc_rank %in% c(0.8, 0.85, 0.9, 0.95)) %>%
group_by(perc_rank) %>%
summarise(max_score = max(round(score,3)))
percentile <- perc_rank_scores_df$perc_rank*100
threshold_vec <- perc_rank_scores_df$max_score
for(i in seq_along(percentile)){
threshold <- threshold_vec[i]
perc <- percentile[i]
out <- sp_scores %>%
filter(score > threshold) %>%
group_by(sample) %>%
summarise(n=n())
out_file <- paste0("data/pcawg_chromothripsis_counts_high_scores_", perc, ".txt")
write.table(out, out_file, row.names = F, quote = F,sep="\t")
}
A conservative threshold was set at the 95th percentile of our distribution of scores to minimise false positives. Calls with scores greater than 0.485 were used to obtain a count of chromothriptic events per sample.
high_score_events_df <- sp_scores %>%
filter(score > perc_rank_scores_df$max_score[perc_rank_scores_df$perc_rank==0.95]) %>%
group_by(sample) %>%
summarise(n_high_scores=n()) %>%
group_by(n_high_scores) %>%
summarise(n_samples=n())
Of 61 samples with scores above the threshold, 49 (80.3%) had 1-2 events, 11 samples (18%) had 3-6 events and 1 sample (1.6%) had 10 events.
# Summary statistics of all calls per sample
calls_per_sample_df <- sp_scores %>%
group_by(sample) %>%
summarise(n=n())
summary_calls <- summary(calls_per_sample_df$n)
summary_calls
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.00 22.00 22.67 28.00 47.00
CHRPscores<-read.table("data/pcawg_chromothripsis_counts_high_scores_95.txt",header=T,sep="\t",stringsAsFactors = F)
CHRPscores<-CHRPscores[CHRPscores$sample%in%colnames(sig_pat_mat_pcawg),]
pdat<-reshape2::melt(sig_pat_mat_pcawg)
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,CHRPscores,by.x=2,by.y=1)
pdat<-pdat[pdat$n>0,]
cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, n,method="p",exact=F)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, n,method="p",exact=F)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Shatterproof score"
pvals$Feature<-"Shatterproof score"
integrated_matrix<-cbind("Shatterproof score",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])
pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(n)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = 8,scales="free_x")+
coord_cartesian(ylim = c(0,10))+
geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=7.5, hjust=-0.05, vjust=0,parse = T)+
geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=8.5, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
strip.text.x = element_text(size = 7))
p
all_cor_dat<-rbind(all_cor_dat,cbind("Shatterproof score",pdat$Signature,pdat$Exposure,pdat$n))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)
Telomere lengths of deep WGS samples were estimated using the Telomerecat software http://telomerecat.readthedocs.org/.
Telomerecat was installed using the python installer:
pip install telomerecat
Once installed, the following command was used to output a length estimate for a given BAM file:
telomerecat bam2length /path/to/bamfile.bam
We adjusted the observed tumour sample telomere length to account for ploidy: adj_length=length*(1/purity). We then calculated the spearman rank correlation for all samples with exposures in a signature greater than 10%.
library(ppcor)
telo<-read.table("data/telomere_length_48_tumor.csv",sep=",",header=T,stringsAsFactors = F)
ID<-gsub("^.*?JBLAB","JBLAB",telo$Sample)
ID<-gsub("-ready.bam","",ID)
purity<-pData(all_CN[,colnames(all_CN)%in%ID])[,c("name","purity")]
purity_ploidy<-cbind(purity,getPloidy(all_CN[,colnames(all_CN)%in%ID]))
patient<-unlist(lapply(strsplit(telo$Sample,"_"),"[[",1))
telo<-data.frame(ID,patient,length=telo$Length,stringsAsFactors = F)
colnames(purity_ploidy)<-c("name","purity","ploidy")
telo<-merge(telo,purity_ploidy,by.x=1,by.y=1)
telo<-cbind(telo,age=surv_dat[match(telo$patient,surv_dat$TRIALNO),"AGE"])
telo<-telo[!is.na(telo$age),]
pdat<-reshape2::melt(cbind(sig_pat_mat_britroc,sig_pat_mat_remain))
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,telo,by.x=2,by.y=1)
cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = pcor.test(Exposure, length,c(purity,age),method="s")$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = pcor.test(Exposure, length,c(purity,age),method="s")$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Telomere length"
pvals$Feature<-"Telomere length"
integrated_matrix<-cbind("Telomere length",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.1&!is.na(integrated_matrix$qvalue<0.1),])
pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(length)/1000))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = nsig,scales="free_x")+
coord_cartesian(ylim = c(0,12))+
geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=10, hjust=-0.05, vjust=0,parse = T)+
geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=8, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
strip.text.x = element_blank(),strip.background = element_blank())
p
pvals$pval<-p.adjust(pvals$pval,method="BH")
all_cor_dat<-rbind(all_cor_dat,cbind("Telomere length",pdat$Signature,pdat$Exposure,pdat$length))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)
This file describes how to calculate the proportion of amplification associated fold-back inversions (ampFBI) from head-to-head inversion (h2hINV) type structural variants and copy-number information. For each sample, the no. of h2hINVs found in a 200kbp region centred around an amplified copy-number segment (CN >=5) was calculated. The proportion of ampFBI was defined as the no. of such h2hINVs relative to the total no. of structural variants in the sample. The X chromosome was left out of analyses.
suppressMessages(library(Biobase))
suppressMessages(library(GenomicRanges))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
path <- "restricted_data/pcawg_sv/"
ext <- ".pcawg_consensus_1.6.161116.somatic.sv.bedpe.gz"
pcawg_ov_filelist <- paste0(path, pcawg_ids, ext)
# Select ids for which both structural variant and copy-number information exist
ids <- pcawg_ids[pcawg_ids %in% names(pcawg_segTabs)]
segTable_colnames <- colnames(pcawg_segTabs[[1]])
ampFBI <- list()
for(sample_name in ids){
# the no. of amplification-associated fold-back inversions in each sample
n_amplifiedFBI <- 0
file_name <- paste0(filepath,sample_name,ext)
sv <- read.table(file_name,header=T, sep ="\t", as.is = T)
sv <- filter(sv, chrom1 != "X")
totalSVs <- dim(sv)[1]
h2h_count <- filter(sv, svclass == "h2hINV") %>%
summarise(n=n()) %>%
.$n
if(h2h_count>0){
h2h_df <- filter(sv, svclass == "h2hINV") %>%
select(chrom1,start1,end2) %>%
mutate(start1 = start1-1e5, end2 = end2+1e5) %>% # add 100KB to either side
rename(chr=chrom1, start=start1, end=end2)
h2h_granges <- makeGRangesFromDataFrame(h2h_df)
seg_granges <- makeGRangesFromDataFrame(pcawg_segTabs[sample_name], keep.extra.columns = T)
# identify segments overlapping h2hINVs
hits <- findOverlaps(h2h_granges,seg_granges)
# filter hits where copy-number >= 5
CN <- as.numeric(elementMetadata(seg_granges )[,1])
highCNhits <- subjectHits(hits)[subjectHits(hits) %in% which(CN >=5)]
if(length(highCNhits)>0){
subset_hits <- hits[subjectHits(hits) %in% highCNhits]
qHits <- queryHits(subset_hits)
# count once if h2hINV overlaps multiple segments
n_amplifiedFBI <- length(unique(qHits))/totalSVs
}
}
ampFBI[[sample_name]] <- n_amplifiedFBI
}
ampFBI.df <- data.frame(names(ampFBI),unlist(ampFBI), stringsAsFactors = F)
colnames(ampFBI.df) <- c("ID", "amp_FBI")
write.table(ampFBI.df, "data/pcawg_amplified_FBI_fraction.txt", sep="\t", quote=F, row.names = F)
fbi<-read.table("data/pcawg_amplified_FBI_fraction.txt",sep="\t",header=T,stringsAsFactors = F)
pdat<-reshape2::melt(sig_pat_mat_pcawg)
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,fbi,by.x=2,by.y=1,all.x=T)
pdat<-pdat[!is.na(pdat$amp_FBI),]
pdat<-pdat[!pdat$amp_FBI==0,]
cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, amp_FBI,method="p",exact=T)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, amp_FBI,method="p",exact=T)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Amp FBI"
pvals$Feature<-"Amp FBI"
pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(amp_FBI)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = nsig,scales="free_x")+
coord_cartesian(ylim = c(0,0.1))+
geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=0.075, hjust=-0.05, vjust=0,parse = T)+
geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=0.085, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
strip.text.x = element_blank(),strip.background = element_blank())
p
integrated_matrix<-cbind("Amplification associated fold-back inversions",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])
pvals$pval<-p.adjust(pvals$pval,method="BH")
all_cor_dat<-rbind(all_cor_dat,cbind("Amp FBI",pdat$Signature,pdat$Exposure,pdat$amp_FBI))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)
age<-c(pcawg_surv_dat$donor_age_at_diagnosis,tcga_clin$age_at_initial_pathologic_diagnosis)
age<-data.frame(ID=c(rownames(pcawg_surv_dat),rownames(tcga_clin)),age,stringsAsFactors = F)
pdat<-reshape2::melt(cbind(sig_pat_mat_pcawg,sig_pat_mat_tcga))
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,age,by.x=2,by.y=1)
cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, age,method="p",exact=T)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, age,method="p",exact=T)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Age diagnosis"
pvals$Feature<-"Age diagnosis"
pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
ggplot(pdat,aes(x=Exposure,y=as.numeric(age)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = nsig,scales="free_x")+
coord_cartesian(ylim = c(0,100))+
geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=-Inf, y=15, hjust=-0.05, vjust=0,parse = T)+
geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=-Inf, y=5, hjust=-0.05, vjust=0,parse=T)+ ylab("Age at diagnosis")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=5),
strip.text.x = element_text(size = 5))
cors$Signature<-as.character(1:nsig)
integrated_matrix<-cbind("Age",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])
pvals$pval<-p.adjust(pvals$pval,method="BH")
all_cor_dat<-rbind(all_cor_dat,cbind("Age diagnosis",pdat$Signature,pdat$Exposure,pdat$age))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)
#this chunk of code is not run be default due to data restrictions.
#to run this code please download hrd_germline_w_CN_exposures.txt into the restricted_data directory and ensure previous restricted data code chunks are run
brca_cases<-read.table("restricted_data/hrd_germline_w_CN_exposures.txt",header=T,stringsAsFactors = F)
brca_loc<-gene_coords[gene_coords$gene=="BRCA1"|gene_coords$gene=="BRCA2",]
BRCAmuts<-HRmuts[(!HRmuts$Type=="nonsynonymous")&(!HRmuts$Gene=="WT"),]
BRCAmuts<-BRCAmuts[!is.na(BRCAmuts$Gene),]
BRCAmuts<-merge(brca_cases,BRCAmuts,by.x=1,by.y=1)
res<-c()
for(j in 1:nrow(BRCAmuts))
{
name<-BRCAmuts[j,"Gene"]
chr<-brca_loc[brca_loc$gene==name,"chr"]
tss<-abs(as.numeric(brca_loc[brca_loc$gene==name,"tss"]))
segTable<-all_segTabs[[BRCAmuts[j,"ID"]]]
cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<tss&as.numeric(segTable[,3])>tss,4])
if(!length(cn)==1)
{
modtss<-tss+1000000
cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<modtss&as.numeric(segTable[,3])>modtss,4])
if(!length(cn)==1)
{
cn<-NA
}
}
res<-c(res,cn)
}
BRCAmuts$cn<-res
BRCAmuts$LOH<-res<1.4
BRCAmutloh<-BRCAmuts[BRCAmuts$LOH&BRCAmuts$Gene=="BRCA2","ID"]
pdat_sig<-reshape2::melt(sig_pat_mat_britroc_all)
colnames(pdat_sig)<-c("Signature","Patient","Exposure")
BRCAmutloh<-c("JBLAB-4195","JBLAB-4127","JBLAB-4162","IM_160")
pdat_brca<-pdat_sig[pdat_sig$Patient%in%BRCAmutloh,]
pdat_brca$Patient<-factor(pdat_brca$Patient,levels=c("JBLAB-4195","JBLAB-4127","JBLAB-4162","IM_160"))
pdat_brca$Signature<-plyr::revalue(pdat_brca$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat_brca,aes(x=Patient,y=Exposure,fill=Signature))+geom_bar(stat="identity")+theme_bw()+
scale_fill_manual( values = cbPalette)+
theme(axis.text.x = element_blank(),axis.ticks.x=element_blank(),
axis.text=element_text(size=7),axis.title=element_text(size=7),
legend.text = element_text(size = 7),legend.title=element_text(size=7),
legend.position='right',
plot.margin = margin(0, 0, 0, 0, "cm"))+ylim(c(0,1))+xlab("BRCA2 germline mutation carriers + somatic LOH (n=4)")+
ylab("Exposures")
p
Here we used cases with archival samples (pre-treatment) and matched relapse samples (post-treatment) to explore whether treatment affected copy-number signatures, or if copy-number signatures could predict response.
#this code chuck is not run by default due to data restrictions. Please obtain the reg.csv file from the Britroc DACO and put it in the restricted_data directory to run
#extract exposures for cases with paried samples
clin<-read.table("restricted_data/reg.csv",sep=",",header=T,stringsAsFactors = F,quote="")
clin<-clin[,c("TRIALNO","PLATINUM")]
clin$PLATINUM<-plyr::revalue(as.character(clin$PLATINUM),c("1"="Sensitive","2"="Resistant"))
priorlines<-read.table("data/prior_lines.csv",sep=",",header=T,stringsAsFactors = F,quote="")
samp_annotation_clin<-merge(samp_annotation,clin,by.x=1,by.y=1,all.x=T)
samp_annotation_clin<-merge(samp_annotation_clin,priorlines,by.x=1,by.y=1,all.x=T)
samples<-samp_annotation_clin[samp_annotation_clin$IM.JBLAB_ID%in%colnames(sig_pat_mat_britroc_all),c(1,2,8,9,10)]
samples<-samples %>%
mutate(sample=ifelse(grepl("IM",IM.JBLAB_ID),"P","R")) %>%
arrange(Britroc_No,desc(star_rating))%>%
group_by(sample) %>% distinct(Britroc_No,.keep_all = TRUE)
samples[samples$IM.JBLAB_ID=="IM_91"|samples$IM.JBLAB_ID=="IM_70","sample"]<-"R"
samples[samples$sample=="P","PriorLinesChemo"]<-0
samples[is.na(samples$PriorLinesChemo),"PriorLinesChemo"]<-0
paired_samples<-samples %>% group_by(Britroc_No) %>% filter(n()>1) %>% as.data.frame
paired_dat<-reshape2::melt(sig_pat_mat_britroc_all[,colnames(sig_pat_mat_britroc_all)%in%paired_samples$IM.JBLAB_ID])
paired_dat<-merge(paired_dat,paired_samples,by.x=2,by.y=2)
colnames(paired_dat)<-c("ID","Signature","Exposure","Britroc_No","star_rating","status","prior_chemo","sample")
saveRDS(paired_dat,"data/paired_sample_details.rds")
Use a linear model to test if signature exposures change in repsonse to treatment:
library(broom)
paired_dat<-readRDS("data/paired_sample_details.rds")
curr_dat<-filter(paired_dat,Britroc_No!=45&Britroc_No!=32)
curr_dat<-tidyr::spread(curr_dat[,c(2,3,4,6,8)],key="sample",value="Exposure")
prepost_dat<-merge(curr_dat,surv_dat[,c("TRIALNO","AGE","PFS","OS")],by.x=2,by.y=1)
prepost_dat<-merge(prepost_dat,unique(paired_dat[paired_dat$sample=="R",c(4,7)]),by.x=1,by.y=1)
prepost_dat$status<-factor(prepost_dat$status,levels=c("Sensitive","Resistant"))
prepost_dat$change=prepost_dat$R-prepost_dat$P
prepost_dat$Signature=factor(prepost_dat$Signature,levels=paste0("s",c(5,1:4,6:8)))
#centre AGE
prepost_dat$AGE2<-(prepost_dat$AGE-mean(prepost_dat$AGE))/sqrt(var(prepost_dat$AGE))
colnames(prepost_dat)<-c("Britroc_No","Signature","status","diagnosis_exposure","relapse_exposure",
"AGE","time_to_relapse","overall_survival","prior_lines_chemo","change_in_exposure","centered_AGE")
#signature changes from diagnosis to relapse taking into account prior exposure, time, and age
diff_sig_time<-prepost_dat %>%
group_by(Signature) %>%
do(model=lm(log(relapse_exposure+1)~log(diagnosis_exposure+1)+time_to_relapse+centered_AGE+prior_lines_chemo,data=.)) %>%
tidy(model) %>% mutate(pval.adj = p.adjust (p.value, method='BH'))
knitr::kable(diff_sig_time%>%
filter(pval.adj<0.05)) %>%
kableExtra::kable_styling(full_width = T)
| Signature | term | estimate | std.error | statistic | p.value | pval.adj |
|---|---|---|---|---|---|---|
| s1 | log(diagnosis_exposure + 1) | 0.7257877 | 0.1539868 | 4.713311 | 0.0000607 | 0.0003036 |
| s2 | log(diagnosis_exposure + 1) | 0.4980334 | 0.1685788 | 2.954305 | 0.0062877 | 0.0314385 |
| s3 | log(diagnosis_exposure + 1) | 0.6304988 | 0.1528667 | 4.124500 | 0.0003007 | 0.0015034 |
| s4 | log(diagnosis_exposure + 1) | 0.5691329 | 0.1360120 | 4.184431 | 0.0002558 | 0.0012790 |
| s6 | log(diagnosis_exposure + 1) | 0.4897941 | 0.1575265 | 3.109281 | 0.0042791 | 0.0213956 |
| s7 | log(diagnosis_exposure + 1) | 0.5559967 | 0.1659289 | 3.350813 | 0.0023182 | 0.0115908 |
Use a generalised linear model to test if signature exposures at diagnosis predict response:
#do signature predict response at diagnosis
diff_sig_response<-prepost_dat %>%
group_by(Signature) %>%
do(model=glm(I(status=="Sensitive")~diagnosis_exposure+centered_AGE,data=.,family="binomial")) %>%
tidy(model) %>%mutate(pval.adj = p.adjust (p.value, method='BH'))
knitr::kable(diff_sig_response%>%
filter(pval.adj<0.05)) %>%
kableExtra::kable_styling(full_width = T)
| Signature | term | estimate | std.error | statistic | p.value | pval.adj |
|---|---|---|---|---|---|---|
| s1 | (Intercept) | 3.985971 | 1.3545777 | 2.942593 | 0.0032548 | 0.0097643 |
| s1 | diagnosis_exposure | -7.521767 | 3.1208918 | -2.410134 | 0.0159467 | 0.0239200 |
| s6 | (Intercept) | 2.078200 | 0.7438572 | 2.793815 | 0.0052090 | 0.0156271 |